Martin K. Petersen has been involved in Linux development since the early nineties. He has worked on PA-RISC and IA-64 Linux ports for HP as well as the XFS filesystem and the Altix kernel for SGI. Martin works in Oracle's Linux Kernel Engineering group where he focuses on enterprise storage technologies.
A recent addition to the SCSI protocol called DIF allows protection information to be exchanged between controller and disk. Each sector of the I/O can be augmented with a blob of integrity metadata which includes a checksum and some knobs that ensure that each portion of the I/O is written in order and to the right place on disk. The SCSI controller generates the integrity metadata, and disk arrays and harddrives can verify that the I/O is intact before they write it and acknowlege the request.
Oracle has worked with several industry partners to extend this capability so that it is now possible to transfer the integrity metadata to and from system memory. This allows the integrity metadata to be prepared by for example the filesystem and submitted with the request to the lower layers. The core of the data integrity infrastructure was included in the upstream kernel during the 2.6.27 merge window.
The next step is figuring out how to expose the integrity metadata to user applications. Ideally, the checksums would be generated in close proximity to the original data within the user application context. And the application (or library) would know what to do in case an error got returned from the I/O subsystem.
The traditional UNIX programming API emerged from the following usage model:
cat foo | frob | mangle > bar
An error would cause the pipeline to break, causing the operator at the teletype to fix the problem and reissue the commands. Each program did one simple thing and its lifetime was short. Consequently, returning -EIO and deferring error handling to the operating system/operator was reasonable behavior.
These days applications run for days/weeks/months/years and few of them are simple filters working on byte streams. Furthermore, the data input into these programs is no longer a reproducible result of the preceding command in the pipeline. The data is generated within the context of the application using input from the user and the network. This means that if committing the write to permanent storage fails for whatever reason, the data is essentially lost. It has become imperative to ensure that data is safely and correctly written.
Given the limitations of POSIX read()/write() and friends, how to we implement a more rugged and data integrity-aware I/O submission interface? Preferably one that's close enough to the existing API that it stands a chance of becoming successful? The POSIX async I/O API does provide some of the required features but hardly anybody use it. And Linux implements it poorly. So where do we go from here?