Platinum Sponsors

  • Intel
  • IBM

Gold Sponsors

  • NetApp

Silver Sponsors

  • HP
  • Google
  • MontaVista
  • Sandisk


  • Portland State University
  • Linux Foundation

Press Partners

  • Linux Journal
  • Linux Weekly News
  • Linux Pro Magazine

Sponsorship opportunities

For more information on sponsorship opportunities, please contact Angela Brown. Linux Plumbers Conf sponsorship packages.

The Future of Linux Storage

We want to make Linux's storage subsystems even better.


We have two presenters; Margaret Susairaj of Oracle and Michael Meeks of Novell.

Margaret Susairaj works in the Server Technologies group at Oracle. She has been playing with Oracle Server I/O stack for the last 10 years. Her main focus is on supporting new I/O technologies and tuning existing I/O stack for better performance. Prior to this, she worked on Oracle scalability issues when running on Sequent's SMP/NUMA systems with respect to synchronization and caching policies.

Michael is a Christian and enthusiastic believer in Free software. He very much enjoys working for Novell where as a member of the Desktop research team he has worked on desktop infrastructure and applications, particularly OpenOffice.org, CORBA, Bonobo, Nautilus and accessibility, amongst other interesting things. He now works as an Architect, trying to understand and nudge the direction of our Linux Desktop work. Prior to this he worked for Quantel gaining expertise in real time AV editing and playback achieved with high performance focused hardware / software solutions.


Michael Meeks will start us off with a half-hour presentation on a tool he has written called iogrind.

After that, Margaret and Michael will alternate, presenting interesting problems in storage for our consideration.

Here are some sample problems (and sometimes sample solutions) to give you an idea about what might be discussed.

Has anything changed in this directory?


The use case is “Do I need to update this icon cache” or “Do I need to reindex these files”. At the moment, userspace has to stat every file. If there were a way to check the mtime of a file more cheaply, this would be a good thing. Inotify is not a solution to this as the machine may have been rebooted and the files modified while the desktop isn't running.

Solution A:

NFSv3 added a READDIRPLUS operation that returns type, mode, nlink, uid, gid, size, rdev, atime, mtime and ctime. readdir() currently returns inode, type (file/directory/symlink/etc) and name. We could add a sys_getdentsplus() that returns all this information.

  • This may make the filesystem do a lot of work. An ext2-type filesystem will have to pull the inode in from storage to get all this information. The savings may not be as large as we might hope (exchanging thousands of quick syscalls for dozens of slow ones isn't necessarily a win). — Matthew Wilcox 2008/07/15 14:23

Want notifications on huge numbers of files


The kernel limits on the number of inotifications that can be set are too restrictive.

Solution A:

The inotify structure is relatively fat and one is required per inode. We could set one per directory and check whether the parent directory has a watch set for its children.

Solution B:

We could avoid allocating per inode by creating notification filters in the kernel that will receive notification of all events and pass only the requested ones to userspace. This might work well when userspace wants to monitor an entire directory tree. (what about namespaces?)

It takes a long time to open a lot of files

Problem: It takes a long time to open/close large number of files without launching multiple threads to do them in parallel.

Solution A: Provide a list interface to issue file open/close operations..

Solution B: Use syslets/threadlets.

Doing discontiguous I/O is inefficient

Problem: There is no batched synchronous I/O interface in Linux. When applications need to do I/O to a set of discontiguous blocks on disk, they have to use io_submit to submit the blocks and wait for them to complete using io_getevents(). There is no single interface that will get the entire I/O list done.

Solution: Extend io_submit to do a batched synchronous I/O operation.

AIO requests must be allocated statically

Problem: Applications need to reserve large number of aio ring buffers on startup since it cannot be resized on demand. The number of concurrent aio requests a process may issue often changes during its lifetime and hence having the ability to increase or decrease on need basis will result in a better utilization of kernel resources. A better solution would be if the kernel can do this automatically.

Solution A: Allow aio ring buffers to dynamically grow/shrink without destroying the original aio context. Provide an io_queue_resize() interface.

Solution B: Allocate the ring buffer as needed without requiring a hint from the user process.


The iogrind talk, with questions took 45 minutes, then Margaret presented her top three problems and a rousing debate was had by all. Then we broke for fifteen minutes, came back and debated a proposal from Jan Kara for keeping track of whether any files have changed in a directory hierarchy. After that, Margaret presented two more problems and our time was up.

Problem 1

Margaret reports it can take minutes to open 4000 files. Often the 'files' being opened are actually block devices and it can take a while to negotiate with each one. The audience were sceptical that there was much parallelism to be gained and asked for data that showed, eg, the time taken for ten threads to open 400 files each versus one thread opening 4000 files. Various interfaces were proposed including:

  • an open_list() which took an array of names to open and returned an array of fds
  • a new open flag (O_ASYNCOPEN) that would allow the kernel to internally spawn a thread to do name resolution and actually connect the fd to anything. Use of the fd would block until it was ready. Many syscalls would have to be able to return new errors such as ENOENT, EISDIR, and any other error that open() could have returned.
  • a new syscall openasync() that would behave as above

Problem 2

This was a request to implement an interface that allows for submitting many IOs to potentially different files and waiting for them all to complete. An HPC group has an interface proposal called readx() and writex(). Linus suggested using sys_readahead() to kick the operations off and then sys_read() to be sure they've finished. The use of O_DIRECT means this possibly won't work well today, but Linus seems quite happy to enhance sys_readahead(). This also doesn't solve the write-side.

Problem 3

The command ring for aio must be sized at startup and cannot be resized. This topic did not attract much interest from the attendees and we moved onward swiftly.

Problem 4

File metadata operations take a long time on ext3 when an fsync is in progress. It was explained that this is simply a bug in ext3 as it requires the entire journal to be committed when an fsync is in progress and this can take a very long time. Various proposals were made for customers to use data=writeback instead of data=ordered, and bugfixes were proposed to ext3 where it could detect this problem and write the journal mot more frequently. It was also suggested that database customers really should be using ext2, not ext3.

Problem 5

After the break, Michael showed a pretty graph showing how much we seek. He suggested that we need an interface which tells userspace which order blocks are in on disc and then it can read them in the right order. FIEMAP was mentioned as a solution to this.

Problem 6

Jan Kara then stepped up to talk about file change notification. He has a solution which involves a new timestamp on directories and a flag to indicate interest from a userspace application. He has a presentation which explains how this works in great detail. The attendees kicked the idea around a bit and found some easily-correctable nits. There seemed general consensus that this was an improvement over inotify. A suggestion was made that we should use the netlink interface to notify apps of file changes, but this was deemed too heavyweight by the other attendees.

Problem 7

Margaret wants some unified way to wait for any kind of events. We decided the real problem is that aio doesn't have an fd interface on which you can select()/poll(). We also kicked around the idea of extending epoll to give AIO notifications.

At this point, we ran out of time. Margaret has many more problems in her presentation which will be uploaded shortly.