next up previous
Next: Asynchronous file unlink/truncate Up: Improving ext3 without changing Previous: Delayed allocation without extents


Efficiently allocating multiple blocks

As with the Alex Tomas's delayed allocation patch, Alex's multiple block allocator patch relies on an incompatible on-disk format change of the ext3 filesystem to support extent maps. In addition, the extent-based mballoc patch also required a format change in order to store data for the buddy allocator which it utilized. Since oprofile measurements of Alex's patch indicated the multiple block allocator seemed to be responsible for reducing CPU usage, and since it seemed to improve throughput in some workloads, we decided to investigate whether it was possible to obtain most of the benefits of a multiple block allocator using the current ext3 filesystem format. This seemed to be a reasonable approach since many of the advantages of supporting Alex's mballoc patch seemed to derive from collapsing a large number of calls to ext3_get_block() into much fewer calls to ext3_get_blocks(), thus avoiding excess calls into the journaling layer to record changes to the block allocation bitmap.

In order to implement a multiple-block allocator based on the existing block allocation bitmap, Mingming Cao first changed ext3_new_block() to accept a new argument specifying how many contiguous blocks the function should attempt to allocate, on a best efforts basis. The function now allocates the first block in the existing way, and then continues allocating up to the requested number of adjacent physical blocks at the same time if they are available.

The modified ext3_new_block() function was then used to implement ext3's get_blocks() method, the standardized filesystem interface to translate a file offset and a length to a set of on-disk blocks. It does this by starting at the first file offset and translating it into a logical block number, and then taking that logical block number and mapping it to a physical block number. If the logical block has already been mapped, then it will continue mapping the next logical block until the requisite number of physical blocks have been returned, or an unallocated block is found.

If some blocks need to be allocated, first ext3_get_blocks() will look ahead to see how many adjacent blocks are needed, and then passes this allocation request to ext3_new_blocks(), searches for the requested free blocks, marks them as used, and returns them to ext3_get_blocks(). Next, ext3_get_blocks() will update the inode's direct blocks, or a single indirect block to point at the allocated blocks.

Currently, this ext3_get_blocks() implementation does not allocate blocks across an indirect block boundary. There are two reasons for this. First, the JBD journaling requests the filesystem to reserve the maximum of blocks that will require journaling, when a new transaction handle is requested via ext3_journal_start(). If we were to allow a multiple block allocation request to span an indirect block boundary, it would be difficult to predict how many metadata blocks may get dirtied and thus require journaling. Secondly, it would be difficult to place any newly allocated indirect blocks so they are appropriately interleaved with the data blocks.

Currently, only the Direct I/O code path uses the get_blocks() interfaces; the mpage_writepages() function calls mpage_writepage() which in turn calls get_block(). Since only a few workloads (mainly databases) use Direct I/O, Suparna Bhattacharya has written a patch to change mpage_writepages() use get_blocks() instead. This change should be generically helpful for any filesystems which implement an efficient get_blocks() function.

Draft patches have already been posted to the ext2-devel mailing list. As of this writing, we are trying to integrate Mingming's ext3_get_blocks() patch, Suparna Bhattacharya's mpage_writepage() patch and Badari Pulavarty's generic delayed allocation patch (discussed in Section 4.2) in order to evaluate these three patches together using benchmarks.


next up previous
Next: Asynchronous file unlink/truncate Up: Improving ext3 without changing Previous: Delayed allocation without extents
Mingming Cao 2005-07-26