By Glenn Paulley on October 13th, 2011
Yesterday afternoon I listened to Alon Halevy of Google Research present a lecture at the Sybase-sponsored Database Seminar Series at the University of Waterloo. Alon’s talk was entitled “Bringing (Web) Databases to the Masses” – here is Alon’s abstract:
The World-Wide Web contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the Web provides a platform that can encourage publishing more data sets from governments and other public organizations and support new data management opportunities, such as effective crisis response, data journalism and crowd-sourcing data sets. To enable such wide-spread dissemination and use of structured data on the Web, we need to create a ecosystem that makes it easier for users to discover, manage, visualize and publish structured data on the Web.
I will describe some of the efforts we are conducting at Google towards this goal and the technical challenges they raise. In particular, I will describe Google Fusion Tables, a service that makes it easy for users to contribute data and visualizations to the Web, and the WebTables Project that attempts to discover high-quality tables on the Web and provide effective search over the resulting collection of 200 million tables.
Alon gave a super-interesting talk, motivating the University of Waterloo students in the audience to dream about what the web could be. Google Fusion Tables and WebTables are steps in that direction. My notes from Alon’s talk follow:
Structured data and the Web
- there is a huge amount of structured data on the web: reference data, hobbies, products, etc. Currently we are primarily reacting to this existing data.
- but the web could be a platform for getting more data out: government data, crime data, data regarding water conditions, the list is (nearly) boundless. Hence we could be more proactive in our use of the web.
- but the web could also enable new kinds of data collection and management: collaboration, crowd-sourcing, real-time data, crisis response, and could then be used to invent a much brighter future
There is currently no single system that will achieve all these goals. Key processes that are required include data cleaning, querying, sharing, integrating, and visualizing. In particular, data visualization is key: data volumes on the web are such that visualization techniques are necessary to be able to synthesize it. Synthesis should be easy – and visualization is a prerequisite. With visualization, one could really tell a story by using a sequence of visualizations.
Challenges
The challenges in making the web a better place for information sharing and collaboration include:
- finding high quality structured data
- getting data out of silos
- extracting data from web pages
- databases are hard to use
- integrating heterogeneous data
- publishing data is cumbersome
Two projects at Google are in progress that go towards solving these problems:
- Google Fusion Tables – a database management service for the web; and
- Google WebTables
Google Fusion Tables
The first goal of Google Fusion Tables is to develop an easy-to-use DBMS that is integrated with the web. Google Fusion Tables key features are:
- easy upload (CSV, KML, spreadsheets)
- sharing (even outside your company)
- visualizations front and center
- easy publishing
and, perhaps most importantly, for all of these services to available with zero administration (no DBA).
The second goal of Google Fusion Tables is to provide an integrated, poly-structured data cloud – to enable the discovery of others’ data and combine it with your own. Hence users can create tables, then create visualizations and share them. Fusion Tables has an API so application developers can develop applications that populate Fusion Tables, synthesize that data with other shared data on the web, and then visualize the result. From Alon’s talk, there is a significant thrust towards data visualization – the predictable problem is that with massive amounts of data, you must employ visualization techniques in order to analyze it.
It turns out that as part of the Fusion Tables initiative, it sort of “happened” that Google built a GIS “in the cloud” as part of the infrastructure. Challenges included “trickling”: showing only a small number of geometry or geographic features, i.e. points or polygons, from a large data set. To automatically render such data, the software needs to thin polygons, clip them to the window, and style features on the fly – and do it all in under 100ms.
Google WebTables
In his talk, Alon indicated that a 3-yr-old Google study found approximately 14 billion English HTML “tables” (i.e. English-language web content with TABLE tags.) Of these 14 billion, over 99% are “uninteresting”. Part of the problem is that there is a brittle relationship between table values and their semantics; sometimes the semantics are hidden and complicated, and on the web one does not readily have a domain model or context to exploit to analyze data semantics, particularly when you introduce cultural differences and differences in language.
In a previously-published paper (WebTables: Exploring the Relational Web (VLDB 2008)), Google researchers found that of this corpus of 14B raw “tables”, they estimated that 154 million were “good” relations (i.e. they could be used for structured data analysis). WebTables is designed to recover “good” relations from a crawl and enable search – which Alon referred to as discovering a (structured) needle in an (unstructured) haystack. Once these are discovered, one can then build novel applications using that data. To capture semantic meaning, such applications can again rely on the web, which has the same scope of semantic meanings that match the content of the tables themselves, and thus one can use the web to develop ontologies to categorize and analyze the original data.
So of the original 14B “tables”, 154 million are “good” tables, with 2.6 million distinct schemas, and 5.4 million total attributes. With the semantic information from the web, one can join these schemas by similarity in attribute names through synonym discovery.
Conclusions
During the talk, Alon showed a variety of examples that illustrated how one can use Google Fusion Tables to combine data sets and create information from their combination. One example Alon showed is a map that combines earthquake instances since 1973 with current nuclear installations, which conveys a lot of meaningful information very quickly.
A clear thrust of this research are two requirements: ease of use and zero administration. Ideally, creating information should require the only the minimal computing knowledge to accomplish the task. Google is certainly pushing the envelope in that direction. In this concluding remarks, Alon hinted that we can expect more developments in this area over the next while, as Google refines better techniques to discover structured data on the web and unify that data with information from other sources.
Various academic papers on WebTables and Google Fusion Tables have been published in recent conferences:
WebTables: VLDB 2008, VLDB 2009, VLDB 2011
Fusion Tables: SIGMOD 2010, SOCC 2010 – and see http://google.com/fusiontables
Communications of the ACM, February 2011
Posted in: Cloud computing · Computer Science education
By Glenn Paulley on October 10th, 2011
Following six successful, previous workshops, the IEEE Data Engineering Workgroup on Self-Managing Database Systems has announced the 7th International Workshop on Self-Managing Database Systems, which will be co-located with the 2012 IEEE Conference on Data Engineering which will be held in Washington, DC from April 1-5, 2012.
The Workshop is being organized by Alejandro Buchmann of Technische Universität Darmstadt (Germany) and Malu Castellanos of HP-Labs. The PC includes Ken Salem and Ashraf Aboulnaga from the University of Waterloo, Shivnath Babu from Duke University, Guy Lohman and Sam Lightstone from IBM, Natassa Ailamaki from EPF Lausanne, and our own Ani Nica from Sybase Waterloo’s SQL Anywhere engineering team.
Autonomic, or self-managing, systems are a promising approach to achieve the goal of systems that are easier to use and maintain in the face of growing system complexity. A system is considered to be autonomic if it is self-configuring, self-optimizing, self-healing and/or self-protecting. The aim of the SMDB workshop is to provide a forum for researchers from both industry and academia to present and discuss ideas and experiences related to self-management and self-organization in all areas of Information Management (IM) in general. SMDB targets not only classical databases but also the new generation of storage engines such as column stores, key-value stores, and in-memory databases. Beyond databases, SMDB aims to cover autonomic aspects of data-intensive systems represented by large-scale map-reduce (e.g., Hadoop) and cloud environments, where much work on self-management is needed. Last but not least, SMDB seeks to expand its horizons to include self-management of non-traditional, new areas of IM such as social networks, distributed gaming, and peer-to-peer systems.
A slight change in emphasis is proposed for this coming workshop:
For the 2012 SMDB workshop, we want to continue to attract researchers from both the core database and other communities, such as the adaptive and event-based systems communities as enabling technologies for self-managing systems, and data-intensive internet-scale distributed systems, which should benefit from research results in SMDBs.
The plethora of new distributed, web-scale database systems – including the soon-to-be launched Beta of SQL Anywhere On Demand Edition – that have been developed over the past two to three years, and the research behind them, should make for a very interesting workshop program.
Important Dates
- Paper submission deadline: November 14, 2011
- Notification of acceptance: December 7, 2011
- Camera-ready copies due: December 21, 2011
- Workshop date: April 1, 2012
Posted in: Self-managing database systems
By Glenn Paulley on September 9th, 2011
The next version of the ISO SQL Standard, SQL/2011, is due to go to FDIS (Final Draft International Standard) ballot for approval by ISO member countries in the next month or so. The international committee that manages the SQL standard’s development is ISO/IEC JTC 1/SC 32 Data Management and Interchange/WG 3 – Database Languages. In 2007, WG3 decided to freeze four of the nine parts of the SQL standard (such as SQL/CLI). The remaining five parts, namely SQL/Framework, SQL/Foundation, SQL/PSM, SQL/Schemata, and SQL/XML, all contain revisions that will be part of the SQL/2011 release.
ISO approval of SQL/2011 as an international standard (IS) is anticipated in Q4 2011 or Q1 2012.
Briefly, the enhancements made to SQL/2011 include:
- various extensions to
WINDOW queries and window aggregate functions, some of which are already present in current versions of SQL Anywhere. These include the following optional language features:
- T614 – the
NTILE window aggregate function
- T615 – the
LEAD and LAG window aggregate functions, along with feature T616, which is the NULL treatment option for the LEAD and LAG functions.
- T617 – the
FIRST_VALUE and LAST_VALUE window aggregate functions, already supported in SQL Anywhere.
- T618 – the
NTH_VALUE window aggregate function
- T619 – support for nested window functions
- T620 – support for the
GROUPS option in the WINDOW clause.
- Optional SQL language feature T495, the ability to use the output of an
INSERT, UPDATE, DELETE, or MERGE statement as a derived table in an SQL query (or other SQL DML statement) – already supported by SQL Anywhere.
- the ability to use named parameters in a
CALL statement (language feature T521), with or without default parameters (language feature T522). Both of these features are already supported by SQL Anywhere, though using slightly different syntax to denote a named parameter (“=” in SQL Anywhere, “=>” in the SQL standard).
- Optional SQL language feature F492, the ability to defer or not enforce uniqueness and foreign key constraints.
- various language features (F860 through F867) to better support the pagination of result sets, required by RESTFUL application paradigms including web-based applications developed using NHibernate or Hibernate. Pagination uses a
FETCH clause as part of a SELECT statement, rather than the TOP/START and LIMIT/OFFSET syntax variants supported by SQL Anywhere in existing releases.
- preliminary support for temporal tables, including system-time temporal tables (feature T180) and application-time temporal tables (feature T181). Comprehensive temporal query support is not included in SQL/2011.
- a wide array of other minor language extensions and a multitude of bug fixes that clarifies existing SQL semantics from SQL/2008.
Posted in: SQL Standard
By Glenn Paulley on August 15th, 2011
Attendees can now register for the 2011 Ontario Celebration of Women in Computing (ONCWIC) conference that will be held at the University of Toronto on 21-22 October 2011.
The Ontario Regional conference is modeled after the well-known Grace Hopper Celebration (GHC) held annually in the USA. The goal of ONCWIC is to provide students, faculty and professional women in Ontario opportunities for networking, sharing and mentoring, similar to those provided by the GHC. It is hoped that ONCWIC will help build the computing community within Ontario, with significant benefits to the attendees and their associated organizations.
The ONCWIC program focuses on research, academic and industry opportunities and networking. A combination of plenary presentations delivered by ICT leaders, plus interactive group discussions, poster sessions and great social events make this event a must for women involved in computing in both industry and academia.
This year’s conference follows a successful and well-received event held at Queen’s University in Kingston in October 2010.
Primary sponsors of this year’s event include Google, IBM Canada Limited, Research in Motion, Royal Bank of Canada, NSERC, MITACS, and the Computer Science department at the University of Toronto.
Thanks to Wendy Powley of Queen’s for sending this my way.
Posted in: Computer Science education
By Glenn Paulley on August 10th, 2011
The 3.2.0 GA release of the NHibernate object-relational mapping toolkit was recently released on July 30. Thanks to the efforts of Julian Maughan, the base NHibernate distribution contains two SQL Anywhere dialects, named SybaseSQLAnywhere10Dialect.cs and SybaseSQLAnywhere11Dialect.cs, that can be used largely unaltered with SQL Anywhere versions 10 and 11. However, some SQL Anywhere software configuration changes in the 12.0.1 release, and the lack of an NHibernate dialect for SQL Anywhere 12 in the NHibernate 3.2.0 distribution, has led me to re-visit NHibernate support for SQL Anywhere 12.0.1.
In January 2010 I had previously posted an NHibernate dialect for SQL Anywhere 12.0.0 for the NHibernate 2.1.0 release. Version 3 of NHibernate has brought about several changes, including the requirement for Microsoft Visual Studio 2010 if you want to be able to build the distribution yourself – and you’ll need to do so in order to have the best software configuration to work with SQL Anywhere 12.0.1.
The executive summary is as follows:
- The SQL Anywhere 12 dialect I posted in January 2010 is largely unaltered but for two minor changes. The first is that its class name, SQLAnywhere12Dialect, has been changed to SybaseSQLAnywhere12Dialect, in keeping with the changes to the class names for the dialects for SQL Anywhere 11 and 12 that Julian made for the NHibernate 3.1.0 distribution. The second change is for the dialect to refer to the metadata class SybaseSQLAnywhere11DataBaseMetaData – see my next point.
- The new metadata class SybaseSQLAnywhere11DataBaseMetaData in the /NHibernate/src/NHibernate/Dialect/Schema directory properly supports the GetReservedWords() API, a problem previously reported by Paul Gibson.
- A new driver in /Nhibernate/src/NHibernate/Driver, named SybaseSQLAnywhere12Driver, is required to support the changes in SQL Anywhere 12.0.1 to the ADO.NET library names for the SQL Anywhere ADO.NET provider.
Included here is a .zip archive of the new source files, along with an instance of HelloNHibernate that you can use for sanity testing with a recompiled NHibernate 3.2.0 DLL. Note that you’ll need to copy the SQL Anywhere .NET provider DLL, iAnywhere.Data.SQLAnywhere.v4.0.dll, into the HelloNHibernate\bin\Debug directory in order to use the included Visual Studio solution. You’ll also require a SQL Anywhere 12.0.1 server to start the included database, helloseq.db, which contains an Employee table with a sequence as its primary key.
SybaseSQLAnywhere12Dialect.cs
The newly-named SybaseSQLAnywhere12Dialect class builds on the SybaseSQLAnywhere11Dialect class included with the NHibernate 3.2.0 distribution. In the main, the SybaseSQLAnywhere12Dialect class provides support for ISO standard SQL sequences, which are now supported in SQL Anywhere 12. There are two other minor changes, one is support for the TIMESTAMP WITH TIME ZONE data type (also known as DATETIMEOFFSET) and the other, as mentioned above, is to use the repaired metadata class SybaseSQLAnywhere11DataBaseMetaData. Here is the code for SybaseSQLAnywhere12Dialect.cs, which should be placed in the NHibernate/src/NHibernate/Dialect directory:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
| using System;
using System.Collections;
using System.Data;
using System.Data.Common;
using System.Text.RegularExpressions;
using NHibernate.Dialect.Function;
using NHibernate.Dialect.Lock;
using NHibernate.Dialect.Schema;
using NHibernate.Engine;
using NHibernate.Exceptions;
using NHibernate.Mapping;
using NHibernate.SqlCommand;
using NHibernate.Type;
using NHibernate.Util;
using Environment = NHibernate.Cfg.Environment;
namespace NHibernate.Dialect
{
/// <summary>
/// SQL Dialect for SQL Anywhere 12 - for the NHibernate 3.2.0 distribution
/// Copyright (C) 2011 Glenn Paulley
/// Contact: http://iablog.sybase.com/paulley
///
/// This NHibernate dialect for SQL Anywhere 12 is a contribution to the NHibernate
/// open-source project. It is intended to be included in the NHibernate
/// distribution and is licensed under LGPL.
///
/// This library is free software; you can redistribute it and/or
/// modify it under the terms of the GNU Lesser General Public
/// License as published by the Free Software Foundation; either
/// version 2.1 of the License, or (at your option) any later version.
///
/// This library is distributed in the hope that it will be useful,
/// but WITHOUT ANY WARRANTY; without even the implied warranty of
/// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
/// Lesser General Public License for more details.
///
/// You should have received a copy of the GNU Lesser General Public
/// License along with this library; if not, write to the Free Software
/// Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
///
/// </summary>
///
/// <remarks> The SybaseSQLAnywhere12Dialect
/// uses the SybaseSQLAnywhere11Dialect as its base class.
/// SybaseSQLAnywhere12Dialect includes support for ISO SQL standard
/// sequences, which are defined in the catalog table <tt>SYSSEQUENCE</tt>.
/// The dialect uses the SybaseSQLAnywhere11MetaData class for metadata API
/// calls, which correctly supports reserved words defined by SQL Anywhere.
/// </remarks>
public class SybaseSQLAnywhere12Dialect : SybaseSQLAnywhere11Dialect
{
/// <summary></summary>
public SybaseSQLAnywhere12Dialect()
{
RegisterDateTimeTypeMappings();
RegisterKeywords();
}
new protected void RegisterKeywords()
{
RegisterKeyword( "NEAR" );
RegisterKeyword( "LIMIT" );
RegisterKeyword( "OFFSET" );
RegisterKeyword( "DATETIMEOFFSET" );
}
new protected void RegisterDateTimeTypeMappings()
{
RegisterColumnType(DbType.DateTimeOffset, "DATETIMEOFFSET");
}
// DDL support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/// <summary>
/// SQL Anywhere supports <tt>SEQUENCES</tt> using a primarily SQL Standard
/// syntax. Sequence values can be queried using the <tt>.CURRVAL</tt> identifier, and the next
/// value in a sequence can be retrieved using the <tt>.NEXTVAL</tt> identifier. Sequences
/// are retained in the SYS.SYSSEQUENCE catalog table.
/// </summary>
public override bool SupportsSequences
{
get { return true; }
}
/// <summary>
/// Pooled sequences does not refer to the CACHE parameter of the <tt>CREATE SEQUENCE</tt>
/// statement, but merely if the DBMS supports sequences that can be incremented or decremented
/// by values greater than 1.
/// </summary>
public override bool SupportsPooledSequences
{
get { return true; }
}
/// <summary> Get the <tt>SELECT</tt> command used retrieve the names of all sequences.</summary>
/// <returns> The <tt>SELECT</tt> command; or NULL if sequences are not supported. </returns>
public override string QuerySequencesString
{
get { return "SELECT SEQUENCE_NAME FROM SYS.SYSSEQUENCE"; }
}
public override string GetSequenceNextValString( string sequenceName )
{
return "SELECT " + GetSelectSequenceNextValString(sequenceName) + " FROM SYS.DUMMY";
}
public override string GetSelectSequenceNextValString( string sequenceName )
{
return sequenceName + ".NEXTVAL";
}
public override string GetCreateSequenceString( string sequenceName )
{
return "CREATE SEQUENCE " + sequenceName; // by default, is START WITH 1 MAXVALUE 2**63-1
}
public override string GetDropSequenceString( string sequenceName )
{
return "DROP SEQUENCE " + sequenceName;
}
// Informational metadata ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
public override IDataBaseSchema GetDataBaseSchema( DbConnection connection )
{
return new SybaseSQLAnywhere11DataBaseMetaData( connection );
}
}
} |
One final note about the dialect file per se. The SybaseSQLAnywhere10Dialect.cs included in the NHibernate 3.2.0 distribution contains a typo for the mapping of binary types; it references an erroneous type – LONG VARBINARY – rather than the correct LONG BINARY. My error. I’ll try to ensure (via Julian) that this is fixed in a subsequent NHibernate release.
[Read more →]
Posted in: Database interfaces and persistent objects · NHibernate · SQL Anywhere
By Glenn Paulley on August 7th, 2011
August 29, 2011 marks the date for the 3rd TPC Technology Conference on Performance Evaluation and Benchmarking, co-located with the 37th International Conference on Very Large Databases (VLDB) which will be held in Seattle, Washington from August 29th through September 3 at the Westin Hotel in downtown Seattle.
TPCTC 2011 is organized again this year by Miekel Poess of Oracle and Raghunath Nambiar of Cisco. The workshop’s goal is to bring together researchers and practitioners who together can help set the agenda for the study of database system performance analysis that will drive the research and development of commercial and open source database systems over the next few years. Topics to be addressed at the workshop include cloud computing, business intelligence, complex event processing, database query optimization, hardware innovations, and enhancements to existing TPC standard benchmarks.
The workshop’s proceedings will be published in a future edition of Springer-Verlag’s LNCS series. Keynote speakers featured at the workshop include HP’s Umesh Dayal and IBM’s Karl Huppler, the current chairman of the TPC.
While the notification date for authors has passed, at the time of writing the workshop agenda has yet to be published. However, this year’s workshop will undoubtedly include a thought-provoking list of presentations that will help set the database performance analysis agenda over the next few years.
Posted in: Computer Science education · Performance measurement
By Glenn Paulley on July 21st, 2011
Routinely, application developers trade off serializable transaction semantics in favour of better execution time performance by limiting the potential for lock contention. Few and far between are applications that execute at ISO/ANSI SQL isolation level 3, SERIALIZABLE. Indeed, the SQL Anywhere default isolation level is zero – READ UNCOMMITTED – except for JDBC applications, where the default is READ COMMITTED.
At the READ UNCOMMITTED isolation level with SQL Anywhere, only schema locks and write row locks are acquired by a transaction during its operation; read row locks are never acquired, and so at READ UNCOMMITTED write transaction do not block read transactions. On the flip side, however, SQL Anywhere does not guarantee semantics at the READ UNCOMMITTED isolation level. To use the common parlance, you get what you pay for. With many applications, the risk and/or impact of uncommitted rows is low; sometimes this can lead to complacency about what READ UNCOMMITTED really means. In this post, I want to illustrate an example where the impact is more obvious.
Set-level UPDATE operations
In the SQL Anywhere Version 5.5 release, circa 1997, we introduced full support for set-level UPDATE statements that could modify columns that were part of a table’s PRIMARY KEY, UNIQUE constraint, or part of a unique index, in support of the ISO SQL/1992 standard which was the current SQL standard at that time. To illustrate, suppose we have the following table:
1
| CREATE TABLE updkey ( a INTEGER PRIMARY KEY, b INTEGER UNIQUE, c VARCHAR(500) ) |
populated by the following INSERT statement:
2
3
| INSERT INTO updkey(a,b,c) SELECT row_num, row_num, 'test string'
FROM rowgenerator WHERE row_num <= 10 |
In this example, we desire to renumber the “b” values of all ten rows using a single atomic statement. We can do so as follows:
4
| UPDATE updkey SET b = 11-b, c = 'New value' |
Processing the UPDATE statement row-by-row clearly won’t do, since the update of any single row in the updkey table will immediately violate the uniqueness constraint on column “b”. (Aside: you may be thinking the WAIT_ON_COMMIT connection option might help here, but WAIT_ON_COMMIT only affects referential integrity constraints, not uniqueness constraints). Consequently, Version 5.5 of SQL Anywhere provided a different mechanism to perform the update, and it has implications for lower levels of concurrency control, as we shall see.
The HOLD temporary table
When the SQL Anywhere server processes an UPDATE or MERGE statement and encounters a uniqueness constraint violation on a primary key, unique index, or unique constraint, the server automatically creates an unnamed “hold” temporary table to temporarily store the problematic rows. The temporary table contains both the before and after values of a row, so that AFTER row and AFTER statement triggers can work correctly. Processing the rows is done row-by-row as follows:
- If the row can be modified without a uniqueness constraint violation, the update proceeds normally.
- If the modification causes a uniqueness constraint violation, then
- the row’s contents, along with its new values, are copied to the hold temporary table;
- the row, along with its index entries, is – temporarily – deleted from the base table. No
DELETE triggers are fired for this temporary deletion.
- any appropriate
AFTER row triggers are fired for this row.
Once all of the rows have been processed, any deleted rows that have been copied to the hold temporary table are then re-inserted into the base table, with the modified values from the UPDATE or MERGE statement. The order in which the rows from the hold temporary table are processed is not guaranteed. If the re-insertion of any of the saved rows still causes a uniqueness violation, then the entire UPDATE or MERGE statement is rolled back, and the uniqueness constraint violation is reported back to the application.
Only if all row modifications are successful are any AFTER statement triggers fired for the request.
Implications
The effect of deleting rows during the execution of an INSERT or MERGE statement can impact the results of
- an SQL statement that queries the same table, issued within an
AFTER row trigger that is fired for the UPDATE or MERGE statement that initiated the action; or
- any other connection, including event handlers, that are not executing at the
SERIALIZABLE or SNAPSHOT isolation levels.
The semantics of this processing of set-level update operations is somewhat counter-intuitive, since on the surface you might expect that another connection concurrently querying the table would either “see” the old row values, or the new row values. However, with set-level update operations on tables with uniqueness constraints, there is the possibility that other connections will not see a particular row at all, depending on the isolation level being used. If the other connection is executing at the SERIALIZABLE isolation level, it will block until the transaction doing the update issues a COMMIT or ROLLBACK. If the other connection is executing at SNAPSHOT isolation, that transaction will continue to see the original values of the modified rows for the duration of that transaction.
This detailed, complex behaviour has previously been undocumented. It will appear in the standard documentation in the next major release of SQL Anywhere.
Posted in: SQL Anywhere
By Glenn Paulley on July 20th, 2011
This week my kids are attending Engineering Science Quest, a super-successful, week-long summer camp program at the University of Waterloo. They’ve been to ESQ before, but this year they’re attending the Tesla technology camp, exposing them to Computer Science rather than the physical sciences as in previous years.
The UW undergraduate students who lead the campers are doing a terrific job in making it interesting. What the Tesla camp is not about is programming. Rather, it’s about using computing to do interesting things. Composing and editing synthetic music. Directing, shooting and editing a music video. Manipulating digital photographs with iPhoto and PhotoShop. And, for some hands-on experience, soldering integrated-circuit boards to construct something (I know not what).
With the fun they’re having, it’s no wonder they’re finding computing interesting (Doh!). Moreover, it is encouraging that so many of the participants are girls, particularly given the University of Waterloo’s current female undergraduate enrollment in Computer Science.
Coincidentally, this afternoon my colleague Steve Olson in Concord, NH shared with me two articles on how other institutions are making Computer Science interesting, particularly for girls. Bringing Girls Into the Science-Major Pipeline, an article published in The Chronicle for Higher Education, describes transforming a marketing approach to attract young people to Computer Science by, simply, surveying the intended audience:
As long as teenagers believe that computer science is boring, difficult, and antisocial, they won’t choose it as a career. But existing stereotypes can be challenged by changing the emphasis—by introducing computer professionals as the lively, interesting people that most of them are, by demonstrating that computer science is an exciting field that has a major impact on just about every aspect of human life, and by communicating all of that to young people using language and images that resonate with them.
The second article is Girls, their families and friends, Gather for Computer Science, which describes an initiative by Pacific University (Oregon) to introduce Computer Science to girls in Grade 7 and 8.
Both articles are a great read. My thanks to Steve for sending them my way.
Posted in: Computer Science education
By Glenn Paulley on July 5th, 2011
This past Monday I listened to Tim Brecht of the School of Computer Science at the University of Waterloo as he presented a paper entitled Our Troubles with Linux and Why You Should Care, a paper [1] co-authored by Tim, Peter Buhr and Ashif Harji which will be presented this coming Monday, 11 July 2011, at the Second ACM SIGOPS Asia-Pacific Workshop on Systems in Shanghai.
Tim’s main interest is computer systems performance, and the talk was about some of the issues Tim and his research group at the University of Waterloo have encountered with various versions of Linux. The paper is a result of his team’s frustrations with Linux as a computing platform; unsurprisingly, his team’s experience dovetails our own experiences with Linux systems here at Sybase iAnywhere. My notes from Tim’s talk follows:
- Paper overview: during the empirical testing of Ashif Harji’s PhD thesis, Ashif encountered three Linux kernel bugs with serious performance implications. Their discovery led to the complete re-running of previous experiments with considerable wasted effort, because the web service performance metrics they had produced were rendered meaningless.
- Why you should care:
- Performance bugs exist in all operating systems, but this seems particularly true with Linux, and has even spawned projects to monitor Linux kernel performance regressions.
- In the Linux community, the perception is that the “latest” Linux kernel is the best, but empirical testing shows that that assumption is absolutely wrong. The Linux performance bugs encountered are serious, and the outcomes include both severe performance regressions and non-repeatable experiments. Some of these bugs have existed in multiple versions of the kernel for three years or more. Conjecture: performance regressions are likely true of other open source projects that do limited performance regression testing.
- Linux as a research platform has important advantages: it’s open source, it’s used in production environments, and it offers a rapid kernel development cycle that keeps it up-to-date with new hardware platforms. However, Linux also has disadvantages: Linux is large and complex, it’s difficult to properly configure and tune, and the project’s rapid kernel development cycle can introduce bugs and performance regressions.
- The kernel bugs encountered by Ashif et al.:
- “small file evictions” – present from 2 March 2005 through 4 Aug 2007. The Linux kernel would always evict small files from file system cache, bypassing LRU.
- “prefetching disabled” – present from 17 June 2005 through 26 Feb 2008. When using the sendfile() function, prefetching on the input file would fail to occur due to coding errors.
- “erratic page evictions” – present from 9 Oct 2007 through 9 Dec 2010 (date of last test; problem could still exist today). The sendfile() function fails to mark pages of files as being accessed; consequently, the kernel evicts pages randomly, leading to random, unpredictable performance.
- Consequences of such bugs:
- there are undoubtedly problems in published papers. Researchers must find the underlying cause for performance changes; otherwise they may simply be confirming the existence of a kernel bug.
- there are considerable consequences for upgrading to a newer kernel. There is a strongly-held but mistaken belief that newer is better; but sometimes old bugs aren’t fixed, and new ones appear.
- As a researcher, debugging Linux kernel issues is necessary, but it is difficult, time consuming, and, at the end of the day, not that beneficial.
- Researchers should:
- verify application performance using well-thought-out performance regression tests;
- recheck results with each kernel upgrade;
- check for repeatability of experimental results (as in embrace the scientific method);
- appeal to their intuition, and ask questions such as “do these results exceed my expectations”?
- ensure a sound experimental environment (e.g. turn off address space randomization).
During the question period after the talk, Tim indicated that as a partial consequence of these sorts of problems, the number of Linux forks appears to be proliferating. It remains to be seen what the Linux landscape will look like over the next few years.
[1] Ashif Harji, Peter A. Buhr, and Tim Brecht (June 2011). Our Troubles with Linux and Why You Should Care. In Proceedings, Second ACM SIGOPS Asia-Pacific Workshop on Systems, Shanghai, China, July 11-12 2011.
Posted in: Operating systems
By Glenn Paulley on July 1st, 2011
This past Monday I listened to Shivnath Babu of Duke University present a lecture entitled “MADDER and Self-tuning data analytics on Hadoop with Starfish“. In a nutshell, the Starfish project is about the development of self-tuning and self-managing technology for Hadoop systems, and many of the same challenges that exist in self-managing relational database systems, such as SQL Anywhere, have their parallels in Hadoop systems.
Papers and source code for Starfish can be found here, made available under the Apache license.
My notes from Shivnath’s talk (slides can be found here):
- Big data analytics is often used in the context of companies such as Google, Yahoo!, Facebook, and eBay. However, web-scale data exists in other realms as well: scientists, journalists, economists, biologists, physicists, and systems researchers all have a “big data” problem. What matters is not just raw size, but also workload complexity: counts and aggregates, are one example, text/image processing is another.
- “Madder” comes from “Mad”, which is from Joe Hellerstein and others at UC Berkeley. “Madder” is an acronym for Magnetic, Agile, Deep, Data-lifecycle-aware, Elastic, and Robust:
- Magnetic: easy to get data into the system;
- Agile: make changes (data and/or requirements) easy;
- Deep: support the full spectrum of analytics – write MapReduce programs in Java, Python, or R, or use interfaces like Pig or Jaql;
- Data-lifecycle-aware: difficult to quantify – consider all phases of data processing over information; not just loading, but processing, archiving, and so on;
- Elastic: adapt resources and costs to the actual workload;
- Robust: graceful degradation during unanticipated events or increases in workload.
- What are the tuning problems with Hadoop clusters? They include (1) job-level MapReduce configuration, (2) workflow optimization, (3) workload management, (4) data layout tuning, and (5) cluster sizing.
- A MapReduce job can be represented as a 4-tuple: job j = < program p, data d, resources r, configuration c >. But Hadoop has upwards of 190 configuration parameters, and the space of potential choices includes the number of Map tasks, the number of Reduce tasks, the amount of partitioning between Map and Reduce tasks, memory allocation, and so on. With so many configuration parameters, optimal execution can be a problem.
- Starfish contains three components: a Profiler, a “What-if” engine, and optimizers. The Profiler executes MapReduce jobs to collect job profiles (a concise execution summary) that includes the recording of information of task “phases” (read, map, collect, spill, merge). The “What-if” engine, given a profile of j = < p, d, r, c > estimates the profile for j’ = < p, d', r', c' >. The optimizers enumerate execution plans but analyzing the optimization space efficiently.
- Generating execution profiles can be done either via measurement, or via the What-if engine. The goals of generating profiles via measurement are (1) have zero overhead when off, low overhead when on; (2) require no mods to Hadoop itself; and (3) support unmodified MapReduce programs written in Java/Python/Ruby/C++.
- Profiling approach: dynamic on-demand instrumentation. Event-condition-action rules are specified in Java, which monitor task phases of MapReduce job execution. Currently this uses BTrace (Hadoop internals are in Java). Instrumentation introduces the capturning of raw profiling data at each JVM for both map and reduce steps. For jobs that involve large numbers of map tasks, sampling is used to reduce the amount of overhead. Total overhead percentage introduced is between 5% to 30%, depending on how much of the complete workload is profiled.
- The “What-if” engine takes a job profile, and hypothetical properties, resources, and configuration settings, and creates a virtual job profile. In Starfish today, this does not generalize across different programs, and the analysis is done through simulation. Eventually the goal is to move the “What-if” engine into Hadoop itself.
- “What if” questions that Stafish can answer:
- how will job j’s execution time change if the number of reduce tasks is changed from 20 to 40?
- how will compression impact Map performance? what if the amount of data is increased by 40%?
This is all a means to an end; goal of Starfish is to get to the Job Optimizer, which is to find the best configuration settings for a hypothetical job j = < p, d', r' >.
- Experimental results indicate that “What-if” analysis, based on cost-based optimization, is better than routinely applying “rules of thumb” to Hadoop configurations. However, Starfish does not (yet) support dynamic run-time adaptive processing, which could improve overall performance significantly.
Posted in: Analytics · Cloud computing · Self-managing database systems