This motivated Suparna Bhattacharya, Badari Pulavarty and Mingming Cao to implement delayed allocation and multiple block allocation support to improve the performance of the ext3 to the extent possible without requiring any on-disk format changes.
Interestingly, the work to remove the use of bufferheads in ext3 implemented most of the necessary changes required for delayed allocation, when bufferheads are not required. The nobh_commit_write() function, delegates the task of writing data to the writepage() and writepages(), by simply marking the page as dirty. Since the writepage() function already has to handle the case of writing a page which is mapped to a sparse memory-mapped files, the writepage() function already handles block allocation by calling the filesystem specific get_block() function. Hence, if the nobh_prepare_write function were to omit call get_block(), the physical block would not be allocated until the page is actually written out via the writepage() or writepages() function.
Badari Pulavarty implemented a relatively small patch as a proof-of-concept, which
demonstrates that this approach works well. The work is still in progress,
with a few limitations to address. The first limitation is that in
the current proof-of-concept patch, data could be dropped if
the filesystem was full, without the write() system call
returning -ENOSPC.
In order to address this problem, the
nobh_prepare_write function must note that the page currently
does not have a physical block assigned, and request the filesystem
reserve a block for the page. So while the filesystem will not have
assigned a specific physical block as a result of
nobh_prepare_write(), it must guarantee that when
writepage() calls the block allocator, the allocation must succeed.
The other major limitation is, at present, it only worked when bufferheads are not needed. However, the nobh code path as currently present into the 2.6.11 kernel tree only supports filesystems when the ext3 is journaling in writeback mode and not in ordered journaling mode, and when the blocksize is the same as the VM pagesize. Extending the nobh code paths to support sub-pagesize blocksizes is likely not very difficult, and is probably the appropriate way of addressing the first part of this shortcoming.
However, supporting delayed allocation for ext3 ordered journaling using this approach is going to be much more challenging. While metadata journaling alone is sufficient in writeback mode, ordered mode needs to track I/O submissions for purposes of waiting for completion of data writeback to disk as well, so that it can ensure that metadata updates hit the disk only after the corresponding data blocks are on disk. This avoids potential exposures and inconsistencies without requiring full data journaling[14].
However, in the current design of generic multi-page writeback routines, block I/O submissions are issued directly by the generic routines and are transparent to the filesystem specific code. In earlier situations where bufferheads were used for I/O, filesystem specific wrappers around generic code could track I/O through the bufferheads associated with a page and link them with the transaction. With the recent changes, where I/O requests are built directly as multi-page bio requests with no link from the page to the bio, this no longer applies.
A couple of solution approaches are under consideration, as of the writing of this paper: