By Glenn Paulley on June 26th, 2012
The third annual Symposium on Cloud Computing , co-sponsored by ACM SIGMOD and ACM SIGOPS, will be held October 14-17 in San Jose, California. This is the first year that SOCC will not be co-located with another conference, which speaks volumes about the Symposium’s popularity and the growing importance of cloud computing. The program chairs for this year’s installment are Mike Carey from UC Irvine and Steven Hand from Cambridge.
Proposals are solicited for tutorials of either a half day (3 hours plus breaks) or full day (6 hours plus breaks). Submissions should be made via e-mail as a single PDF attachment sent to the SoCC PC Chairs and should include a cover sheet and an extended abstract. The cover sheet should specify:
- the title and the length of the tutorial;
- the intended audience (introductory, intermediate, or advanced);
- complete contact information for the contact person and other presenters.
- a brief biography for each presenter.
The 2-4 page extended abstract should include:
- motivation for having a tutorial on the proposed topic;
- explanations why the organizers think their proposed tutorial would be beneficial to the cloud computing community and of high interest and/or impact;
- an outline of the tutorial, along with descriptions of the course objectives and course materials.
- Tutorials Submission Deadline: July 13, 2012
- Notification of Acceptance: August 10, 2012
- Camera-Ready Summary Deadline: September 28, 2012
- Tutorial Day at SoCC: October 17, 2012
For questions regarding tutorial submissions please email the SoCC PC Chairs at email@example.com.
My thanks to Daniel Freedman, SoCC 2012 Publicity Chair, for sending this my way.
Posted in: Cloud computing · Computer Science education
By Glenn Paulley on June 4th, 2012
The Department of Mathematics and Computer Science at Mount Allison University in Sackville, New Brunswick is hosting the first annual Atlantic Canadian Celebration of Women in Computing during the weekend of October 13-14, 2012 at the Amherst Wandlyn Inn in Amherst, Nova Scotia. The goal of this Workshop is to highlight career opportunities and accomplishments of women in computing and is inspired by the Grace Hopper events in the US and the similar ONCWIC workshops held in Ontario.
Poster and presentation submissions are due on or before September 21.
For further information, please contact:
Department of Math & Computer Science
Mount Allison University
67 York Street
Sackville, NB E4L 1E6
or by email to ACCWIC2012@MTA.CA.
Thanks to Wendy Powley of Queen’s University in Kingston, ON for sending this my way.
Posted in: Computer Science education
By Glenn Paulley on May 31st, 2012
Back in 2011 I wrote an article entitled “The seven deadly sins of database application performance” and I followed that introductory article in April 2011 with one regarding the first “deadly sin” that illustrated some issues surrounding weak typing within the relational model.
In this article I want to discuss the implications of concurrency control and, in particular, the tradeoffs in deciding to use the weaker SQL standard isolation levels
READ UNCOMMITTED and
Contention through blocking
Most commercial database systems that support the SQL Standard isolation levels  of
REPEATABLE READ, and
SERIALIZABLE use 2-phase locking (2PL), commonly at the row-level, to guard against update anomalies by concurrent transactions. The different isolation levels affect the behaviour of reads but not of writes: before modifying a row, a transaction must first acquire an exclusive lock on that row, which is retained until the transaction performs a
ROLLBACK, thus preventing further modifications to that row by another transaction(s). Those are the semantics of 2PL.
Consequently, it is easy to design an application that intrinsically enforces serial execution. One that I have written about previously – Example 1 in that whitepaper – is a classic example of serial execution. In that example, the application increments a surrogate key with each new client to be inserted, yielding a set of SQL statements like:
UPDATE surrogate SET @x = NEXT KEY, NEXT KEY = NEXT KEY + 1
WHERE object-TYPE = 'client';
INSERT INTO client VALUES(@x, ...);
Since the exclusive row lock on the ‘client’ row in the surrogate table is held until the end of the transaction, this logic in effect forces serialization of all client insertions. Note that testing this logic with one, or merely a few, transactions will likely fail to trigger a performance problem; it is only at scale that this serialization becomes an issue, a characteristic of most, if not all, concurrency control problems except for deadlock.
Hence lock contention, with serialization as one of its most severe forms, is difficult to test because the issues caused by lock contention are largely performance-related. They are also difficult to solve by increasing the application’s degree of parallelism, since that typically yields only additional waiting threads, or by throwing additional compute power at the problem, for, as sometimes stated by my former mentor at Great-West Life, Gord Steindel: all CPUs wait at the same speed.
Why not lower isolation levels?
With 2PL, write transactions block read transactions executing at
READ COMMITTED or higher. The number, and scope, of these read locks increase as one moves to the
SERIALIZATION isolation level, which offers serializable semantics at the expense of concurrent execution in a mixed workload of readers and writers. Consequently it is logical to tradeoff the server’s guarantee of serialized transaction schedules with better performance by reducing the number of read locks to be acquired, and hence reduce the amount of blocking – a strategy that makes sense for many applications with a typical 80-20 ratio of read transactions to write transactions.
But this tradeoff is not free; it is made at the expense of exposing the application to data anomalies that occur as the result of concurrent execution with update transactions. But this exposure is, again, very hard to quantify: how would one attempt to measure the risk of acting on stale data in the database, or overwriting a previously-modified row (often termed the “lost update” problem)? Once again, the problem is exacerbated at scale, which makes analysis and measurement of this risk difficult to determine during a typical application development cycle.
Some recent work  that explores these issues was on display at the 2012 ACM SIGMOD Conference held last week in Phoenix, Az. At the conference, graduate student Kamal Zellag and his supervisor, Bettina Kemme, of the School of Computer Science at McGill University in Montreal demonstrated ConsAD, a system that measures the number of serialization graph cycles that develop within the application at run time – where a cycle implies a situation involving either stale data, a lost update, or both. A full-length paper  presented at last year’s IEEE Data Engineering Conference in Hannover, Germany provides the necessary background; here is the abstract:
While online transaction processing applications heavily rely on the transactional properties provided by the underlying infrastructure, they often choose to not use the highest isolation level, i.e., serializability, because of the potential performance implications of costly strict two-phase locking concurrency control. Instead, modern transaction systems, consisting of an application server tier and a database tier, offer several levels of isolation providing a trade-off between performance and consistency. While it is fairly well known how to identify the anomalies that are possible under a certain level of isolation, it is much more difficult to quantify the amount of anomalies that occur during run-time of a given application. In this paper, we address this issue and present a new approach to detect, in realtime, consistency anomalies for arbitrary multi-tier applications. As the application is running, our tool detect anomalies online indicating exactly the transactions and data items involved. Furthermore, we classify the detected anomalies into patterns showing the business methods involved as well as their occurrence frequency. We use the RUBiS benchmark to show how the introduction of a new transaction type can have a dramatic effect on the number of anomalies for certain isolation levels, and how our tool can quickly detect such problem transactions. Therefore, our system can help designers to either choose an isolation level where the anomalies do not occur or to change the transaction design to avoid the anomalies.
The Java application system described in the paper utilizes Hibernate, the object-relational mapping tooklit from JBoss. ConsAD is in two parts: a “shim”, called ColAgent, that captures application traces and implemented by modifying the Hibernate library used by the application; and DetAgent, an analysis piece that analyzes the serialization graphs produced by ColAgent to look for anomalies. In their 2011 study, the authors found that the application under test, termed RuBis, suffered from anomalies when it used Hibernate’s built-in optimistic concurrency control scheme (termed JOCC in the paper), 2PL using
READ COMMITTED, or (even) PostgreSQL’s implementation of snapshot isolation (SI). This graph, from the 2011 ICDE paper, illustrates the frequency of anomalies for the RUBiS “eBay simulation” with all three concurrency-control schemes. Note that in these experiments snapshot isolation consistently offered the fewest anomalies at all benchmark sizes, a characteristic that application architects should study. But SI is not equivalent to serializability, something other authors have written about [4-7] and still causes low-frequency anomalies during the test.
The graph is instructive in not only illustrating that anomalies occur with all three concurrency control schemes, but that the frequency of these anomalies increase dramatically with scale. Part of the issue lies with Hibernate’s use of caching; straightforward row references will result in a cache hit, whereas a more complex query involving nested subqueries or joins would execute against the (up-to-date) copies of the row(s) in the database, leading to anomalies with stale data. As such, these results should serve as a warning to application developers using ORM toolkits since it is quite likely that they have little, if any, idea of the update and/or staleness anomalies that their application may encounter when under load.
It would be brilliant if Kamal and Bettina expanded this work to cover other application frameworks other than Hibernate, something I discussed with Kamal at length while in Phoenix last week. Hibernate’s mapping model makes this sort of analysis easier than (say) unrestricted ODBC applications, but if it existed such a tool would be very useful in discovering these sorts of anomalies for other types of applications.
 K. Zellag and B. Kemme (May 2012). ConsAD: a real-time consistency anomalies detector. In Proceedings of the 2012 ACM SIGMOD Conference, Phoenix, Arizona, pp. 641-644.
 K. Zellag and B. Kemme (April 2011). Real-Time Quantification and Classification of Consistency Anomalies in Multi-tier Architectures. In Proceedings of the 27th IEEE Conference on Data Engineering, Hannover, Germany, pp. 613-624.
 H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O’Neil, and P. O’Neil (May 1995). A critique of ANSI SQL isolation levels. In Proceedings of the ACM SIGMOD Conference, San Jose, California, pp. 1-10.
 A. Fekete (January 1999). Serialisability and snapshot isolation. In Proceedings of the Australian Database Conference, Auckland, New Zealand, pp. 201-210.
 A. Fekete, D. Liarokapis, E. J. O’Neil, P. E. O’Neil, and D. Shasha (2005). Making snapshot isolation serializable. ACM Transactions on Database Systems 30(2), pp. 492-528.
 S. Jorwekar, A. Fekete, K. Ramamritham, and S. Sudarshan (September 2007). Automating the detection of snapshot isolation anomalies. Proceedings of the 33rd International Conference on Very Large Data Bases, Vienna, Austria, pp. 1263-1274.
 A. Fekete, E. O’Neil, and P. O’Neil (2004). A read-only transaction anomaly under snapshot isolation. ACM SIGMOD Record 33(3), pp. 12-14.
Posted in: Database interfaces and persistent objects · Hibernate · Performance measurement · SQL Standard
By Glenn Paulley on May 30th, 2012
The Fourth TPC Conference on Performance Evaluation and Benchmarking will be held on August 27, 2012 in Istanbul, collocated with the 2012 VLDB Conference at the Istanbul Hilton Hotel.
Over the last couple of years many new areas of very large database applications has emerged stressing the ability of today’s hardware and software infrastructures, e.g. cloud computing, social media etc. Yet the industry is lacking standard ways to evaluate different infrastructures. As a result, the TPC is conducting a fourth conference, in conjunction with VLDB 2012, to encourage researchers and industry experts to submit novel ideas and methodologies in performance evaluation, measurement, and characterization in the following areas:
* Big Data analytics and infrastructure
* Database Appliances
* Cloud Computing
* In-memory databases
* Social media infrastructure
* Business intelligence
* Complex event processing
* Database optimizations
* Green computing
* Disaster tolerance and recovery
* Energy and space efficiency
* Hardware innovations
* Data Integration
* Hybrid workloads
* Lessons learned in practice using TPC workloads
* Enhancements to TPC workloads
Accepted papers will be published on the TPC website, ACM DL and DBLP, and considered for future TPC benchmark developments.
The keynote speaker at the Conference will be Michael J. Carey of the University of California, Irvine.
Timelines for the Conference are coming up very quickly: paper abstracts are due next Friday, June 8, and full papers are due the following Friday, June 15 2012.
Posted in: Computer Science education · Performance measurement
By Glenn Paulley on May 29th, 2012
Following two successful conferences in 2010 at Queen’s University, and in 2011 at the University of Toronto, the University of Western Ontario has announced that it will host the third Ontario Celebration of Women in Computing conference on October 12-13 2012. The conference chair is Hanan Lutfiyya of Western Ontario, and the sponsorship co-ordinator is Wendy Powley from Queen’s.
Female graduate and undergraduate students, faculty, and professionals in industry are all invited to participate in this two-day conference. Proposals for both posters and presentations are now being accepted and are due on or before September 21.
Call for Posters:
The goal of the poster session is to enable informal discussions on each presenter’s research or computing project. Poster presentations foster the sharing of ideas, problems, and results in an informal setting. Posters are an excellent way to convey results and ideas that are not yet complete, or ready to be published in a paper. We encourage posters on computing-related research, or outreach activities to promote ICT to youth. Works at any level of study, including undergraduate projects/summer projects etc. are encouraged.
Call for Presentations:
Presentations will be given as “Lightening Talks” which are short oral presentations (approx. 5-10 minutes) followed by a question period. Talks on cutting-edge technical issues, original research and on social issues of relevance to undergraduate and graduate women in computing are encouraged. Published or unpublished work is welcome.
Posted in: Computer Science education
By Glenn Paulley on May 11th, 2012
SQL/RPR (Row Pattern Matching), a proposed new part to the ISO SQL standard under development in the United States under ANSI is now available for initial public review and comment.
Row-pattern matching in the standard is something I have blogged about previously, and has seen some other initiatives in the academic literature as well.
SQL/RPR is all about complex analysis of data streams, though in a SQL context (ie. over persistent data) rather than over a streaming database system. The big deal is supporting predicates whose expressions can refer to different tuples of an ordered intermediate result simultaneously.
Here’s the overview from RPR from the INCITS website:
This standard will specify the syntax and semantics of a new SQL capability to perform complex queries involving the relationships between many rows in a single (virtual or base) table. Detection and use of such relationships are critical aspects of many high-value applications. Sometimes called complex event processing, many business processes are driven from sequences of events. For example, security applications require the ability to detect unusual behavior definable with regular expressions. Financial applications that detect stock patterns are widely demanded. Fraud detection applications must recognize patterns in financial and other transactions. RFID processing requires the ability to recognize valid paths for RFID tags.
Extremely high interest in these capabilities has been shown by financial institutions, by the USA Department of Homeland Security and other government agencies, by large retailers and their suppliers, and transportation companies, among others.
Upon approval of the national SQL/RPR standard, which will be written as an amendment to Part 2 Foundation [FoundFDIS] of [SQL 2008], H2 [sic; should be DM32.2] expects to submit it to ISO/IEC JTC 1 possibly using the Fast-Track process with proposed maintenance in JTC 1/SC32 if approved.
The public review period is from now until June 18, 2012. The draft document is available here currently for a price of US$30. The $30 fee appears to be in error as the intent of the public review was to make the draft of RPR freely available. I hope to have an update on that in the very near future.
Posted in: SQL Standard
By Glenn Paulley on April 16th, 2012
The Fifth International Workshop on Testing Database Systems will be held on 21 May 2012, co-located with the ACM SIGMOD Conference held in Scottsdale, Arizona. The aim of the Workshop is to bring together academics and practitioners to discuss the complexities of testing DBMS systems and applications:
There is significant interest in testing database systems and applications within both the database and the software engineering communities. The goal of DBTest 2012 is to bring together researchers and practitioners from academia and industry to discuss key problems and ideas related to testing database systems and applications. We expect that this collaboration will facilitate the creation of research agendas and new techniques to address testing problems for database systems and database applications. The long-term objective of such work is to reduce the cost and time required to test and tune database products so that users and vendors can spend more time and energy on actual innovations.
The workshop program was finalized over the weekend by workshop co-chairs Eric Lo of Hong Kong Polytechnic and Florian Waas of EMC/Greenplum and the keynote talk by Oege de Moor, along with the 12 accepted papers, chosen from the 26 submissions, are all interesting and should make for a great workshop.
Posted in: Computer Science education · Self-managing database systems
By Glenn Paulley on March 14th, 2012
Proxy tables, sometimes referred to Remote Data Access or OMNI, are a convenient way to query or modify tables in different databases all from the same connection. SQL Anywhere’s proxy tables are an implementation of a loosely-coupled multidatabase system. The underlying databases do not have to be SQL Anywhere databases – any data source that supports ODBC will do, so the underlying base table for the proxy can be an Oracle table, a Microsoft SQL Server table, even an Excel spreadsheet. Once the proxy table’s schema is defined in the database’s catalog, the table can be queried just like any other table as if it was defined as a local table in that database.
That’s the overall idea, anyway; but there are some caveats that get introduced as part of the implementation, and I’d like to speak to one of these in particular. My post is prompted by a question from a longstanding SQL Anywhere customer, Frank Vestjens, who in early February in the NNTP newsgroup sybase.public.sqlanywhere.general queried about the following SQL batch:
DECLARE dd DATE;
DECLARE tt TIME;
DECLARE resultaat NUMERIC;
SET dd = '2012-06-07';
SET tt = '15:45:00.000';
MESSAGE dd + tt TYPE info TO console;
SELECT FIRST Id INTO resultaat
WHERE arrivalDate + IsNull(arrivaltime,'00:00:00') <= dd+tt
ORDER BY arrivaldate+arrivalTime,departuredate+departureTime;
The batch works fine with a local table
p_mmptankplanning but gives an error if the table is a proxy table; the error is “Cannot convert 2012-06-0715:45:00.000 to a timestamp”.
In SQL Anywhere, multidatabase requests are decomposed into SQL statements that are shipped over an ODBC connection to the underlying data source. In many cases, the complete SQL statement can be shipped to the underlying server, something we call “full passthrough mode” as no post-processing is required on the originating server – the server ships the query to the underlying DBMS, and that database system returns the result set which is percolated back to the client. Since the originating server is a SQL Anywhere server, the SQL dialect of the original statement must be understood by SQL Anywhere. If the underlying DBMS isn’t SQL Anywhere, then the server’s Remote Data Access support may make some minor syntactic changes to the statement, or try to compensate for missing functionality in the underlying server.
The SQL statement sent to the underlying DBMS, whether or not the statement can be processed in full passthrough mode or in partial passthrough mode, is a string. Moreover, SQL Anywhere can ship
MERGE statements to the underlying DBMS – among others – but lacks the ability to ship batches or procedure definitions.
So in the query above, the problem is that the query refers to the date/time variables
tt, and uses the operator
+ to combine them into a
TIMESTAMP. Since SQL Anywhere lacks the ability to ship an SQL batch, what gets shipped to the underlying DBMS server is the SQL statement
SELECT FIRST Id INTO resultaat
WHERE arrivalDate + IsNull(arrivaltime,'00:00:00') <= '2012-06-07' + '15:45:00.000'
ORDER BY arrivaldate+arrivalTime,departuredate+departureTime;
and now the problem is more evident: in SQL Anywhere, the ‘+’ operator is overloaded to support both operations on date/time types, and on strings; with strings, ‘+’ is string concatentation. When the statement above gets sent to the underlying SQL Anywhere server, it concatenates the two date/time strings to form the string ’2012-06-0715:45:00.000′ – note no intervening blank – and this leads directly to the conversion error. Robust support for SQL batches would solve the problem, but we have no plans to introduce such support at this time. A workaround is to compose the desired
TIMESTAMP outside the query, so that when converted to a string the underlying query will give the desired semantics. However, even in that case care must be taken to make sure that the
DATEFORMAT option settings are compatibile across the servers involved.
My thanks to my colleague Karim Khamis for his explanations of Remote Data Access internals.
Posted in: SQL Anywhere
By Glenn Paulley on March 13th, 2012
Yesterday the first release candidate of NHibernate 3.3.0 was made available; you can find it here on sourceforge.net.
In addition to the numerous bug fixes for NHibernate’s LINQ support, of interest to SQL Anywhere users will be some bug fixes and changes for SQL Anywhere 12.0.1 that I have described previously in this forum. These changes include:
- A fix to the type mappings in SybaseSQLAnywhere10Dialect for binary data to
- A new driver module,
NHibernate.Driver.SybaseSQLAnywhereDotNet4Driver, linked to the SybaseSQLAnywhere12Dialect, to handle the renaming of the assemblies for the SQL Anywhere ADO.NET provider in SQL Anywhere version 12. Unfortunately, specifying a different assembly name for a .NET provider isn’t a configuration parameter in NHibernate; the driver name is embedded directly in the NHibernate library.
Thanks to Julian Maughan for committing the changes to the NHibernate distribution.
Posted in: NHibernate
By Glenn Paulley on February 29th, 2012
I don’t usually “re-tweet” someone else’s blog post. There are enough bits of
information data flying through the Internet that I don’t need to duplicate any more of them. At least, no more than absolutely necessary.
However, today I am going to make an exception and draw your attention to a recent post by Amazon Web Services VP James Hamilton entitled “Observations on Errors, Corrections, & Trust of Dependent Systems“. I have known James for the better part of twenty years and his writing on infrastructure efficiency, reliability, and scaling makes for compelling reading. If you’re not reading James’ blog, Perspectives, on a regular basis, you should be. In part, here’s what James had to say about the need for ECC memory in both client and server systems:
The immediate lesson is you absolutely do need ECC in server application[sic] and it is just about crazy to even contemplate running valuable applications without it. The extension of that learning is to ask what is really different about clients? Servers mostly have ECC but most clients don’t. On a client, each of these corrections would instead be a corruption. Client DRAM is not better and, in fact, often is worse on some dimensions. These data corruptions are happening out there on client systems every day. Each day client data is silently corrupted. Each day applications crash without obvious explanation. At scale, the additional cost of ECC asymptotically approaches the cost of the additional memory to store the ECC. I’ve argued for years that Microsoft should require ECC for Windows Hardware Certification on all systems including clients. It would be good for the ecosystem and remove a substantial source of customer frustration. In fact, it’s that observation that leads most embedded systems parts to support ECC. Nobody wants their car, camera, or TV crashing. Given the cost at scale is low, ECC memory should be part of all client systems.
James’ post is timely because this week I was asked by SQL Anywhere Product Manager Eric Farrar to respond to a request from an OEM hardware infrastructure manufacturer for feedback regarding future product designs.
Other than the obvious “cheaper and faster” my wish is for one thing: robustness.
As James describes in his post, at scale, “hardware” failures are rife, causing errors, logical and physical data corruptions, system outages, crashes, you name it. I placed “hardware” in quotation marks deliberately because today’s disk and flash memory hardware contains a vast quantity of software as well; the microcode for the filesystem on compact flash (CF) and SD is OEM’d and consists of thousands of lines of code. ECC correction would be nice, not only for flash or traditional magnetic media but also for RAM, as James notes. Yet even with ECC correction, things aren’t that rosy, particularly with “commodity” hardware. Consider the abstract of this IEEE paper  from Remzi Arpaci-Dusseau‘s storage group at the University of Wisconsin:
We use type-aware pointer corruption to examines Windows NTFS and Linux ext3. We find that they rely on type and sanity checks to detect corruption, and NTFS recovers using replication in some instances. However, NTFS and ext3 do not recover from most corruptions, including many scenarios for which they possess sufficient redundant information, leading to further corruption, crashes, and unmountable file systems. We use our study to identify important lessons for handling corrupt pointers.
I have written about storage stack corruption at various times in the past – see here and here. That corruption need not be permanent to cause problems: logical (data) corruption caused by transient failures can be just as bad as permanent ones. James’ point is that corruption detection and mitigation needs to take place in all system components, including RAM.
All of this is bad enough. Yet the situation isn’t helped by the lack of standards in this area. We know from the experience of our customers that various systems fail to meet expected behaviour with respect to I/O semantics – these behaviours are, sometimes, deliberately changed in the name of better performance, but at the expense of robustness. SQL Anywhere customers are well advised to read this whitepaper entitled “SQL Anywhere I/O Requirements for Windows and Linux” for background information on what I/O semantics your server systems must support.
 Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M. Swift (June 2008). Analyzing the Effects of Disk Pointer Corruption. In Proceedings of the International Conference on Dependable Systems and Networks, Anchorage, Alaska, pp. 502-511.
Posted in: Hardware · Operating systems