In November I attended a lecture at the University of Waterloo – part of the database seminar series at UW sponsored by Sybase iAnywhere – given by Remzi Arpaci-Dusseau entitled File systems are broken (and what we’re doing to fix them). Remzi, his colleague (and spouse) Andrea, and graduate students and other faculty at the University of Wisconsin, operate the ADSL Laboratory, whose mandate is to study issues with physical data storage. Remzi and his colleagues have authored various papers on the reliability of the storage stack, not simply hard media failures but disk corruption issues stemming from software bugs to transient media failures and how these failures can become catastrophic, depending on the device drivers, file system, and operating system being used.
In the studies, Remzi’s graduate students would introduce artificial, pseudo-random errors in a software shim at a point in the I/O stack, and then track what happened. The types of errors introduced included (virtual) read and/or write errors to various file system components: ordinary data blocks, inode blocks, and so on. They would then categorize the types of failures and compare the results across media types, file systems, and operating systems.
Frankly, I was startled at the results. Some general trends uncovered by their analysis: SCSI disks, though more expensive, have a longer MTBF than cheaper ATA drives, so the “you get what you pay for” adage appears to hold. More surprising to me was the general lack of robustness in the file systems that were studied. As an example, Arpaci-Dusseau and his team found that the EXT3 file system, in common use in Linux systems, does virtually no detection of write failures, nor retry – making EXT3 more susceptible to transient media failure, and only when a subsequent read of that block will the error be detected. Other file systems such as JFS faired considerably better, but the failure to detect errors is also present in JFS. The studies make for interesting reading.
I mention this now because of a conversation with Peter Bumbulis yesterday, who has installed OpenSolaris on one of his home machines with the ZFS file system, developed at Sun Microsystems by Jeff Bonwick and his team. Jeff’s blog makes interesting reading.
ZFS is a journaling file system that offers a variety of robustness features and in addition offers some interesting self-management features that are quite compelling. ZFS supports mirroring efficiently, and contains self-healing algorithms that utilize the mirrored data to correct corruption automatically without user intervention. Disks can be dynamically added to a ZFS “pool” and made available immediately for use, even in a striped RAID configuration and even if the disks are heterogeneous – ZFS automatically alters the striping pattern to suit individual spindle performance, and a single ZFS system command includes the new (or replaced) disk in the pool. Slick.
This level of self-management is truly compelling; I applaud Jeff and his team for what they’ve achieved.
With the robustness issues that have been reported with flash SSD devices that are now becoming available, ZFS may offer a level of data integrity insurance that is otherwise unavailable.

Glenn Paulley is a Director of Engineering at Sybase iAnywhere.

4 responses so far ↓
1 BMC Software and virtualization support // Oct 11, 2008 at 4:36 pm
[...] are most suited for farms of virtual servers? Earlier this spring, Remzi Arpaci-Dusseau gave a talk at the University of Waterloo entitled File systems are broken (and what we’re doing to fix [...]
2 Is there such a thing as an Operating System? // Oct 12, 2008 at 9:56 pm
[...] smartphone is not the only place where the nature of operating systems is changing. Virtualization means that an “operating system” is not necessarily tied to a single computer, so you can [...]
3 Trends in hardware: some recent developments // Oct 23, 2008 at 5:53 pm
[...] file systems to cope with both transient and unrecoverable I/O errors is poor, as I have described elsewhere in this blog. The interesting thing a year later is that various commentators are predicting that [...]
4 Disk failures in the real world // Oct 26, 2009 at 9:49 pm
[...] Hu [3] believe that the trend towards lower-cost magnetic media results in higher failure rates, a conclusion also reached [7] by Remzi Arpaci-Dusseau and his team at the University of Wisconsin in Madison. To some extent, at least, you do get what [...]
Leave a Comment
Note that all comments are currently being moderated until I have a better handle on spam, so your comment may not appear for a couple of hours