Sybase iAnywhere SQL AAnywhere Mobile and Embedded Database

I'd rather play golf


Thoughts on data management, autonomic computing, and self-managing database systems.

header image

SQL/RPR available for initial public review

By Glenn Paulley on May 11th, 2012

SQL/RPR (Row Pattern Matching), a proposed new part to the ISO SQL standard under development in the United States under ANSI is now available for initial public review and comment.

Row-pattern matching in the standard is something I have blogged about previously, and has seen some other initiatives in the academic literature as well.

SQL/RPR is all about complex analysis of data streams, though in a SQL context (ie. over persistent data) rather than over a streaming database system. The big deal is supporting predicates whose expressions can refer to different tuples of an ordered intermediate result simultaneously.

Here’s the overview from RPR from the INCITS website:

This standard will specify the syntax and semantics of a new SQL capability to perform complex queries involving the relationships between many rows in a single (virtual or base) table. Detection and use of such relationships are critical aspects of many high-value applications. Sometimes called complex event processing, many business processes are driven from sequences of events. For example, security applications require the ability to detect unusual behavior definable with regular expressions. Financial applications that detect stock patterns are widely demanded. Fraud detection applications must recognize patterns in financial and other transactions. RFID processing requires the ability to recognize valid paths for RFID tags.

Extremely high interest in these capabilities has been shown by financial institutions, by the USA Department of Homeland Security and other government agencies, by large retailers and their suppliers, and transportation companies, among others.

Upon approval of the national SQL/RPR standard, which will be written as an amendment to Part 2 Foundation [FoundFDIS] of [SQL 2008], H2 [sic; should be DM32.2] expects to submit it to ISO/IEC JTC 1 possibly using the Fast-Track process with proposed maintenance in JTC 1/SC32 if approved.

The public review period is from now until June 18, 2012. The draft document is available here currently for a price of US$30. The $30 fee appears to be in error as the intent of the public review was to make the draft of RPR freely available. I hope to have an update on that in the very near future.

→ 3 CommentsPosted in: SQL Standard

Call for participation: DBTEST 2012

By Glenn Paulley on April 16th, 2012

The Fifth International Workshop on Testing Database Systems will be held on 21 May 2012, co-located with the ACM SIGMOD Conference held in Scottsdale, Arizona. The aim of the Workshop is to bring together academics and practitioners to discuss the complexities of testing DBMS systems and applications:

There is significant interest in testing database systems and applications within both the database and the software engineering communities. The goal of DBTest 2012 is to bring together researchers and practitioners from academia and industry to discuss key problems and ideas related to testing database systems and applications. We expect that this collaboration will facilitate the creation of research agendas and new techniques to address testing problems for database systems and database applications. The long-term objective of such work is to reduce the cost and time required to test and tune database products so that users and vendors can spend more time and energy on actual innovations.

The workshop program was finalized over the weekend by workshop co-chairs Eric Lo of Hong Kong Polytechnic and Florian Waas of EMC/Greenplum and the keynote talk by Oege de Moor, along with the 12 accepted papers, chosen from the 26 submissions, are all interesting and should make for a great workshop.

→ No CommentsPosted in: Computer Science education · Self-managing database systems

Limitations of proxy tables

By Glenn Paulley on March 14th, 2012

Proxy tables, sometimes referred to Remote Data Access or OMNI, are a convenient way to query or modify tables in different databases all from the same connection. SQL Anywhere’s proxy tables are an implementation of a loosely-coupled multidatabase system. The underlying databases do not have to be SQL Anywhere databases – any data source that supports ODBC will do, so the underlying base table for the proxy can be an Oracle table, a Microsoft SQL Server table, even an Excel spreadsheet. Once the proxy table’s schema is defined in the database’s catalog, the table can be queried just like any other table as if it was defined as a local table in that database.

That’s the overall idea, anyway; but there are some caveats that get introduced as part of the implementation, and I’d like to speak to one of these in particular. My post is prompted by a question from a longstanding SQL Anywhere customer, Frank Vestjens, who in early February in the NNTP newsgroup sybase.public.sqlanywhere.general queried about the following SQL batch:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
BEGIN
  DECLARE dd DATE;
  DECLARE tt TIME;
  DECLARE resultaat NUMERIC;
  //
  SET dd = '2012-06-07';
  SET tt = '15:45:00.000';
  //
  MESSAGE dd + tt TYPE info TO console;
  //
  SELECT FIRST Id INTO resultaat
  FROM p_mmptankplanning
  WHERE arrivalDate + IsNull(arrivaltime,'00:00:00') <= dd+tt
  ORDER BY arrivaldate+arrivalTime,departuredate+departureTime;
END

The batch works fine with a local table p_mmptankplanning but gives an error if the table is a proxy table; the error is “Cannot convert 2012-06-0715:45:00.000 to a timestamp”.

Operator overloading

In SQL Anywhere, multidatabase requests are decomposed into SQL statements that are shipped over an ODBC connection to the underlying data source. In many cases, the complete SQL statement can be shipped to the underlying server, something we call “full passthrough mode” as no post-processing is required on the originating server – the server ships the query to the underlying DBMS, and that database system returns the result set which is percolated back to the client. Since the originating server is a SQL Anywhere server, the SQL dialect of the original statement must be understood by SQL Anywhere. If the underlying DBMS isn’t SQL Anywhere, then the server’s Remote Data Access support may make some minor syntactic changes to the statement, or try to compensate for missing functionality in the underlying server.

The SQL statement sent to the underlying DBMS, whether or not the statement can be processed in full passthrough mode or in partial passthrough mode, is a string. Moreover, SQL Anywhere can ship SELECT, INSERT, UPDATE, DELETE and MERGE statements to the underlying DBMS – among others – but lacks the ability to ship batches or procedure definitions.

So in the query above, the problem is that the query refers to the date/time variables dd and tt, and uses the operator + to combine them into a TIMESTAMP. Since SQL Anywhere lacks the ability to ship an SQL batch, what gets shipped to the underlying DBMS server is the SQL statement

1
2
3
4
SELECT FIRST Id INTO resultaat
  FROM p_mmptankplanning
  WHERE arrivalDate + IsNull(arrivaltime,'00:00:00') <=  '2012-06-07' + '15:45:00.000'
  ORDER BY arrivaldate+arrivalTime,departuredate+departureTime;

and now the problem is more evident: in SQL Anywhere, the ‘+’ operator is overloaded to support both operations on date/time types, and on strings; with strings, ‘+’ is string concatentation. When the statement above gets sent to the underlying SQL Anywhere server, it concatenates the two date/time strings to form the string ’2012-06-0715:45:00.000′ – note no intervening blank – and this leads directly to the conversion error. Robust support for SQL batches would solve the problem, but we have no plans to introduce such support at this time. A workaround is to compose the desired TIMESTAMP outside the query, so that when converted to a string the underlying query will give the desired semantics. However, even in that case care must be taken to make sure that the DATE_ORDER and DATEFORMAT option settings are compatibile across the servers involved.

My thanks to my colleague Karim Khamis for his explanations of Remote Data Access internals.

Comments OffPosted in: SQL Anywhere

NHibernate 3.3.0CR1 released

By Glenn Paulley on March 13th, 2012

Yesterday the first release candidate of NHibernate 3.3.0 was made available; you can find it here on sourceforge.net.

In addition to the numerous bug fixes for NHibernate’s LINQ support, of interest to SQL Anywhere users will be some bug fixes and changes for SQL Anywhere 12.0.1 that I have described previously in this forum. These changes include:

  • A fix to the type mappings in SybaseSQLAnywhere10Dialect for binary data to LONG BINARY;
  • A new driver module, NHibernate.Driver.SybaseSQLAnywhereDotNet4Driver, linked to the SybaseSQLAnywhere12Dialect, to handle the renaming of the assemblies for the SQL Anywhere ADO.NET provider in SQL Anywhere version 12. Unfortunately, specifying a different assembly name for a .NET provider isn’t a configuration parameter in NHibernate; the driver name is embedded directly in the NHibernate library.

Thanks to Julian Maughan for committing the changes to the NHibernate distribution.

→ 2 CommentsPosted in: NHibernate

Robustness and error handling

By Glenn Paulley on February 29th, 2012

I don’t usually “re-tweet” someone else’s blog post. There are enough bits of information data flying through the Internet that I don’t need to duplicate any more of them. At least, no more than absolutely necessary.

However, today I am going to make an exception and draw your attention to a recent post by Amazon Web Services VP James Hamilton entitled “Observations on Errors, Corrections, & Trust of Dependent Systems“. I have known James for the better part of twenty years and his writing on infrastructure efficiency, reliability, and scaling makes for compelling reading. If you’re not reading James’ blog, Perspectives, on a regular basis, you should be. In part, here’s what James had to say about the need for ECC memory in both client and server systems:

The immediate lesson is you absolutely do need ECC in server application[sic] and it is just about crazy to even contemplate running valuable applications without it. The extension of that learning is to ask what is really different about clients? Servers mostly have ECC but most clients don’t. On a client, each of these corrections would instead be a corruption. Client DRAM is not better and, in fact, often is worse on some dimensions. These data corruptions are happening out there on client systems every day. Each day client data is silently corrupted. Each day applications crash without obvious explanation. At scale, the additional cost of ECC asymptotically approaches the cost of the additional memory to store the ECC. I’ve argued for years that Microsoft should require ECC for Windows Hardware Certification on all systems including clients. It would be good for the ecosystem and remove a substantial source of customer frustration. In fact, it’s that observation that leads most embedded systems parts to support ECC. Nobody wants their car, camera, or TV crashing. Given the cost at scale is low, ECC memory should be part of all client systems.

James’ post is timely because this week I was asked by SQL Anywhere Product Manager Eric Farrar to respond to a request from an OEM hardware infrastructure manufacturer for feedback regarding future product designs.

Other than the obvious “cheaper and faster” my wish is for one thing: robustness.

As James describes in his post, at scale, “hardware” failures are rife, causing errors, logical and physical data corruptions, system outages, crashes, you name it. I placed “hardware” in quotation marks deliberately because today’s disk and flash memory hardware contains a vast quantity of software as well; the microcode for the filesystem on compact flash (CF) and SD is OEM’d and consists of thousands of lines of code. ECC correction would be nice, not only for flash or traditional magnetic media but also for RAM, as James notes. Yet even with ECC correction, things aren’t that rosy, particularly with “commodity” hardware. Consider the abstract of this IEEE paper [1] from Remzi Arpaci-Dusseau‘s storage group at the University of Wisconsin:

We use type-aware pointer corruption to examines Windows NTFS and Linux ext3. We find that they rely on type and sanity checks to detect corruption, and NTFS recovers using replication in some instances. However, NTFS and ext3 do not recover from most corruptions, including many scenarios for which they possess sufficient redundant information, leading to further corruption, crashes, and unmountable file systems. We use our study to identify important lessons for handling corrupt pointers.

I have written about storage stack corruption at various times in the past – see here and here. That corruption need not be permanent to cause problems: logical (data) corruption caused by transient failures can be just as bad as permanent ones. James’ point is that corruption detection and mitigation needs to take place in all system components, including RAM.

All of this is bad enough. Yet the situation isn’t helped by the lack of standards in this area. We know from the experience of our customers that various systems fail to meet expected behaviour with respect to I/O semantics – these behaviours are, sometimes, deliberately changed in the name of better performance, but at the expense of robustness. SQL Anywhere customers are well advised to read this whitepaper entitled “SQL Anywhere I/O Requirements for Windows and Linux” for background information on what I/O semantics your server systems must support.

[1] Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M. Swift (June 2008). Analyzing the Effects of Disk Pointer Corruption. In Proceedings of the International Conference on Dependable Systems and Networks, Anchorage, Alaska, pp. 502-511.

Comments OffPosted in: Hardware · Operating systems

Call for Papers: 2012 DBTEST Workshop, Scottsdale, AZ

By Glenn Paulley on February 7th, 2012

This year’s DBTEST Workshop, being organized by Florian Waas of EMC/Greenplum and Eric Lo of Hong Kong Polytechnic University, is co-located with the 2012 ACM SIGMOD Conference and will be held on Monday, 21 May 2012 in Scottsdale, Arizona.

Eric and Florian, along with the workshop’s program committee which includes Guy Lohman (IBM), Jens Dittrich (University of the Saarland), Meikel Poess (Oracle), and myself, are soliciting six-page extended abstracts on a wide variety of topics related to database application and database system testing:

Besides technical contributions around test ideas, test frameworks and methodologies, we also solicit vision papers and war stories regarding quality issues. As somebody who is actively involved with the aspects of actually building data management systems I’m certain you either have specific systems knowledge that could be help and advice for other researchers or open questions that may inspire colleagues to start researching a topic! We count on practitioners to help us enrich the program with war stories that help workshop participants recognize common problems.

This year’s workshop features a keynote address from Oxford’s Oege de Moor, the principal developer behind the Semmle code analysis products, which features the query language .QL.

Papers are due Monday, March 12, 2012 (5PM Pacific Time). Notifications of acceptance will be sent on or before April 5, and camera-ready copies are due April 22.

See the Workshop site for additional information.

Comments OffPosted in: Computer Science education · Self-managing database systems · Semmle .QL

Database Self-Management: Taming the Monster

By Glenn Paulley on December 23rd, 2011

This past July, when Duke University’s Shivnath Babu presented his talk on Starfish at the University of Waterloo, Shivnath invited my team to contribute a paper to a special forthcoming issue of the IEEE Technical Bulletin on Data Engineering, to be published this December.

This special issue, edited by Shivnath and Kai-Uwe Sattler of Germany’s Ilmenau University of Technology, has now been published online, and constitutes a collection of experience papers on database self-management. Our contribution is entitled “Database Self-Management: Taming the Monster” [1] and the paper outlines some of the lessons and experiences of the SQL Anywhere development team over the twenty-year history of the SQL Anywhere product. Here’s the abstract:

We describe our experience in the development of self-managing technologies in Sybase SQL Anywhere, a full-featured relational database system that is commonly embedded with applications developed by Independent Software Vendors (ISVs). We illustrate how SQL Anywhere’s embeddability features work in concert to provide a robust data management solution in zero-administration environments.

Our short paper covers various technologies used within SQL Anywhere, and outlines how we have designed them to offer self-managing characteristics. The list of technologies includes our query optimizer, buffer pool management, self-tuning multiprogramming level, and our spatial data support.

This special issue of the IEEE Data Engineering Bulletin is a great read, as the other papers that make up the special issue are authored by other well-known pioneers in self-managing technology, including Surajit Chaudhuri and Nico Bruno of Microsoft Research.

[1] Mohammed Abouzour, Ivan T. Bowman, Peter Bumbulis, David E. DeHaan, Anil K. Goel, Anisoara Nica, G. N. Paulley, and John Smirnios (December 2011). Database Self-Management: Taming the Monster. IEEE Technical Bulletin on Data Engineering 34(4), pp. 6-11.

Comments OffPosted in: Database Management Systems · Self-managing database systems · SQL Anywhere

SQL:2011 is published

By Glenn Paulley on December 16th, 2011

As I mentioned in September, the latest version of the ISO SQL Standard had been in the final draft (FDIS) stage through the autumn, and yesterday, 15 December, SQL:2011 became the official version of the SQL standard and is now available for purchase on the ISO website.

The SQL:2011 standard is not yet available from ANSI’s webstore but I expect it will be available shortly and I will post a comment when copies from ANSI become available.

→ 3 CommentsPosted in: SQL Standard

Using Cassandra as a cloud file system

By Glenn Paulley on November 4th, 2011

On Wednesday this week (November 2) I had the opportunity to listen to Ken Salem of the University of Waterloo present a lecture entitled “A Scalable, Available Storage Tier for RDBMS” to the Database Systems Group at UW’s Cheriton School of Computer Science. This joint work, with co-authors Ashraf Aboulnaga and post-doctoral fellow Rui Liu, considers the use of the Cassandra distributed database system as a robust, distributed filesystem to support a cloud of MySQL database servers. For me, the most interesting part of the project – which is currently nicknamed DBECS (pronounced D-Backs – as in the Arizona DiamondBacks major league baseball team) – is how closely this research project dovetails with the goals of SQL Anywhere’s On Demand Edition (codenamed “Fuji”), particularly its approach towards the use of independent tenant databases, rather than multi-tenancy solutions such as Force.com. My notes follow:

Overview

The objective of the DBECS project is to develop a multi-tenant database management service with elastic scalability of storage capacity, 24×7 operation, and with full transaction and SQL support. In general, “NoSQL” systems (ie. HBase, Cassandra) tradeoff functionality for scaleability/availability, via the implementation of simple key-value stores and atomic access to a single “record”. In contrast, the idea behind DBECS is to start with an established, relational DBMS that provides relatively complete ACID semantics and full SQL support, and see how the software infrastructure can be changed to support higher levels of scalability.

Aside: this is where DBECS mirrors what we at Sybase iAnywhere are trying to accomplish with SQL Anywhere On Demand Edition: offer full SQL and transaction support – indeed, requiring no application changes – and support higher scalability not by increasing the size of a multi-tenant database instance, but by scaling in the number of databases.

With DBECS, the idea is to:

  • Use a MySQL instance to host each tenant database;
  • Use Cassandra as the cloud storage tier to provide 24×7 availability and geographic distribution.

Benefits of this approach:

  • scalable, elastic storage capacity and bandwidth;
  • scalable, elastic tenancy;
  • highly available storage tier with no single point of failure;
  • full-function SQL support with ACID transactions.

There are, however, things that DBECS does not provide; in particular, there are no inherent improvements to the scaling of an individual, hosted MySQL database. However, an always-available storage tier might simplify high-availability mechanisms, and could be exploited to restore the database quickly in case of corruption.

The DBECS Prototype

The prototype system uses an everyday MySQL server (using INNODB) to host a tenant database instance, hence providing full MySQL facilities including transaction atomicity and durability without the need for application changes. (Aside: this is the same idea with SQL Anywhere Fuji).

Rather than use local disk for the database storage, however, the DBECS prototype uses Cassandra as a glorified, high-availability file system. The advantages of Cassandra are:

  • Cassandra uses a multi-master replication paradigm;
  • supports geographically-distributed, multiple data centers;
  • has partition tolerance;
  • supports fine grained (per operation) control of the consistency/performance tradeoff; that is, the application (MySQL in this case) can specify the desired semantics of each read and write operation (e.g. slow and consistent, fast but not ACID);
  • Cassandra supports client-controlled update serialization; in DBECS, the client is a MySQL DBMS, which understands something about updates and serialization.

An underlying assumption with Cassandra is that there is a single writer (MySQL instance) for each database. This means that the consistency issue is staleness issue, and not a concurrent write operation issue.

Using Cassandra

Cassandra stores “column families”, which are tables of semi-structued records accessed by key; the primitive operations in Cassandra are reading a field from a record, and updating a field in a record. Cassandra records are replicated and distributed by hashing keys.

Cassandra supports per-operation consistency specification. For example, Cassandra supports write(1) versus write(ALL) semantics. With the former, Cassandra distributes the write to all replicas, but the operation will return as soon as one copy has acknowledged the write operation. In contrast, write(ALL) semantics involve waiting until all copies have been written. Similarly, with read(1) versus read(ALL), the former returns a record from any copy; while the latter performs a read from each replica, and returns the record with the most recent timestamp. While read(ALL) provides “quorum consensus”, the issue with read(ALL) is not only the loss of performance due to the read latency of all of the replica copies of the record, but also that it can suffer from an availability problem: if not all replicas are available, the read operation will block. Other variations, such as read(2), are also possible.

Cassandara as a DBMS storage tier

As stated above, DBECS uses MySQL with its INNODB storage layer. INNODB issues block reads/writes to the underlying filesystem. With DBECS, a shim was added to intercept read/write calls to the I/O interface, and replace them with Cassandra read/write API calls to the Cassandra storage layer. In a nutshell, DBECS treats a Cassandra column family as a big (virtual) disk, addressable by block ID; so MySQL INNODB blocks are mapped to Casandra records, keyed by block ID where a key for a block is the triple < dbname , file number, block number >.

What’s interesting with this design are the inherent tradeoffs. For example,
write(ALL) coupled with read(1) can work but write performance is poor, and results in an availability problem since all copies must be available or the I/O operation will block. Ideally, one would want to use read(1) and write(1) exclusively, but the issue there is that read staleness (inconsistency) is interpreted by INNODB as a corrupt database. The problem is that with read(1)/write(1), you don’t get a guarantee that what the MySQL server wrote will be read back from the same node. With write(1) Cassandra only waits for one write operation before returning; eventually all writes will complete, but a read not from the completed write node will return stale data.

To counter this problem, Ken, Ashraf, and Rui consider the introduction of “optimistic I/O” semantics. Cassandra supports version numbers; so blocks are augmented with version numbers and timestamps to check for staleness. If a read operation matches the expected version number, the read is successful, but if a mismatch occurs then the read(1) operation can be retried (aggressive) or read(ALL) can be used (conservative).

Cassandra will detect and recover from node failures. However, with the above model, failures are not always hidden from the client (the MySQL INNODB storage layer). Moreover, there is a question as to the reliability and safety of write(1). To improve the fault tolerance, DBECS has added support for client-controlled I/O synchronization for better tolerance of Cassandra failures. When INNODB issues a file system sync call (fsync) to the underlying file system, Cassandra issues a “Cassandra sync” to try to guarantee durability by waiting long enough for multiple copies to be updated.

DBECS is an interesting project whose premise, like SQL Anywhere On Demand Edition, is the support and scalability of individual tenant databases in a cloud environment. Ken assures me that there is nothing in DBECS specific to MySQL – any database system would do, as long as the low-level I/O API calls could be replaced with the corresponding Cassandra API calls.

I hope to see this work in the literature in short order.

→ 2 CommentsPosted in: Cloud computing · MySQL · SQL Anywhere

SQL Anywhere at Insight Out

By Glenn Paulley on October 31st, 2011

On 20 October I had the privilege to address the 2011 Insight Out Conference in Tokyo, Japan, where the majority of attendees are Japanese business application developers and the speakers are drawn from a cross-section of the database industry, with Oracle, Microsoft, PostgreSQL, MySQL, and Sybase represented. Insight Out is spearheaded by Hiroki Mizokami of Insight Technology.

For the talk, I decided to do a one-hour overview of the self-managing features of SQL Anywhere 12, in particular highlighting two new server features in SQL Anywhere 12: auto-tuning of the server’s multiprogramming level, and the self-healing abilities to find and correct statistical errors with our self-tuning column histograms. I also added a very brief overview of SQL Anywhere On Demand Edition, codenamed “Fuji”, to further illustrate the need for self-managing data management when it is combined with cloud technology. Slides for the talk in English are here, and the same slides in Japanese are here.

As part of the motivation for the talk, I used the example of IvanAnywhere, Ian McHardy’s home-grown telepresence robot that my colleague Ivan Bowman uses to telecommute from his home office in Nova Scotia. My sincere thanks to our iAnywhere colleagues in Japan for translating the audio soundtrack for the Ivan Anywhere video, and the soundtrack for the Fuji “Gone Fishing Software” video, into Japanese – that effort was well worthwhile, making the content so much more accessible for the conference attendees.

In addition, my thanks to my iAnywhere Solutions K.K. colleagues for their warm hospitality during my stay in Japan. It was without question the best business trip of my 25-year career.

Comments OffPosted in: Cloud computing · SQL Anywhere