I don’t usually “re-tweet” someone else’s blog post. There are enough bits of information data flying through the Internet that I don’t need to duplicate any more of them. At least, no more than absolutely necessary.
However, today I am going to make an exception and draw your attention to a recent post by Amazon Web Services VP James Hamilton entitled “Observations on Errors, Corrections, & Trust of Dependent Systems“. I have known James for the better part of twenty years and his writing on infrastructure efficiency, reliability, and scaling makes for compelling reading. If you’re not reading James’ blog, Perspectives, on a regular basis, you should be. In part, here’s what James had to say about the need for ECC memory in both client and server systems:
The immediate lesson is you absolutely do need ECC in server application[sic] and it is just about crazy to even contemplate running valuable applications without it. The extension of that learning is to ask what is really different about clients? Servers mostly have ECC but most clients don’t. On a client, each of these corrections would instead be a corruption. Client DRAM is not better and, in fact, often is worse on some dimensions. These data corruptions are happening out there on client systems every day. Each day client data is silently corrupted. Each day applications crash without obvious explanation. At scale, the additional cost of ECC asymptotically approaches the cost of the additional memory to store the ECC. I’ve argued for years that Microsoft should require ECC for Windows Hardware Certification on all systems including clients. It would be good for the ecosystem and remove a substantial source of customer frustration. In fact, it’s that observation that leads most embedded systems parts to support ECC. Nobody wants their car, camera, or TV crashing. Given the cost at scale is low, ECC memory should be part of all client systems.
James’ post is timely because this week I was asked by SQL Anywhere Product Manager Eric Farrar to respond to a request from an OEM hardware infrastructure manufacturer for feedback regarding future product designs.
Other than the obvious “cheaper and faster” my wish is for one thing: robustness.
As James describes in his post, at scale, “hardware” failures are rife, causing errors, logical and physical data corruptions, system outages, crashes, you name it. I placed “hardware” in quotation marks deliberately because today’s disk and flash memory hardware contains a vast quantity of software as well; the microcode for the filesystem on compact flash (CF) and SD is OEM’d and consists of thousands of lines of code. ECC correction would be nice, not only for flash or traditional magnetic media but also for RAM, as James notes. Yet even with ECC correction, things aren’t that rosy, particularly with “commodity” hardware. Consider the abstract of this IEEE paper [1] from Remzi Arpaci-Dusseau‘s storage group at the University of Wisconsin:
We use type-aware pointer corruption to examines Windows NTFS and Linux ext3. We find that they rely on type and sanity checks to detect corruption, and NTFS recovers using replication in some instances. However, NTFS and ext3 do not recover from most corruptions, including many scenarios for which they possess sufficient redundant information, leading to further corruption, crashes, and unmountable file systems. We use our study to identify important lessons for handling corrupt pointers.
I have written about storage stack corruption at various times in the past – see here and here. That corruption need not be permanent to cause problems: logical (data) corruption caused by transient failures can be just as bad as permanent ones. James’ point is that corruption detection and mitigation needs to take place in all system components, including RAM.
All of this is bad enough. Yet the situation isn’t helped by the lack of standards in this area. We know from the experience of our customers that various systems fail to meet expected behaviour with respect to I/O semantics – these behaviours are, sometimes, deliberately changed in the name of better performance, but at the expense of robustness. SQL Anywhere customers are well advised to read this whitepaper entitled “SQL Anywhere I/O Requirements for Windows and Linux” for background information on what I/O semantics your server systems must support.
[1] Lakshmi N. Bairavasundaram, Meenali Rungta, Nitin Agrawal, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, and Michael M. Swift (June 2008). Analyzing the Effects of Disk Pointer Corruption. In Proceedings of the International Conference on Dependable Systems and Networks, Anchorage, Alaska, pp. 502-511.

Glenn Paulley is a Director of Engineering at Sybase iAnywhere.
