next up previous
Next: Future plans Up: Extents, delayed allocation and Previous: Extents based delayed allocation


Buddy based extent allocation

One of the shortcomings of the current ext3 block allocation algorithm, which allocates one block at a time, is that it is not efficient enough for high speed sequential writes. In one experiment utilizing direct I/O on a dual Opteron workstation with fast enough buses, fiber channel, and a large, fast RAID array, the CPU limited the I/O throughput to 315 MB/s. While this would not be an issue on most machines (since the maximum bandwidth of a PCI bus is 127 MB/s), but for newer or enterprise-class servers, the amount of data per second that can be written continuously to the filesystem is no longer limited by the I/O subsystem, but by the amount of CPU time consumed by ext3's block allocator.

To address this problem, Alex Tomas designed and implemented a multiple block allocation, called mballoc, which uses a classic buddy data structure on disk to store chunks of free or used blocks for each block group. This buddy data is an array of metadata, where each entry describes the status of a cluster of 3#3 blocks, classified as free or in use.

Since block buddy data is not suitable for determining a specific block's status and locating a free block close to the allocation goal, the traditional block bitmap is still required in order to quickly test whether a specific block is available or not.

In order to find a contiguous extent of blocks to allocate, mballoc checks whether the goal block is available in the block bitmap. If it is available, mballoc looks up the buddy data to find the free extent length starting from the goal block. To find the real free extent length, mballoc continues by checking whether the physical block right next to the end block of the previously found free extent is available or not. If that block is available in the block bitmap, mballoc could quickly find the length of the next free extent from buddy data and add it up to the total length of the free extent from the goal block.

For example, if block M is the goal block and is claimed to be available in the bitmap, and block M is marked as free in buddy data of order n, then initially the free chunk size from block M is known to be 3#3. Next, mballoc checks the bitmap to see if block 4#4 is available or not. If so, mballoc checks the buddy data again, and finds that the free extent length from block 4#4 is k. Now, the free chunk length from goal block M is known to be 5#5. This process continues until at some point the boundary block is not available. In this manner, instead of testing dozens, hundreds, or even thousands of blocks' availability status in the bitmap to determine the free blocks chunk size, it can be enough to just test a few bits in buddy data and the block bitmap to learn the real length of the free blocks extent.

If the found free chunk size is greater than the requested size, then the search is considered successful and mballoc allocates the found free blocks. Otherwise, depending on the allocation criteria, mballoc decides whether to accept the result of the last search in order to preserve the goal block locality, or continue searching for the next free chunk in case the length of contiguous blocks is a more important factor than where it is located. In the later case, mballoc scans the bitmap to find out the next available block, then, starts from there, and determines the related free extent size.

If mballoc fails to find a free extent that satisfies the requested size after rejecting a pre-defined number (currently 200) of free chunks, it stops the search and returns the best (largest) free chunk found so far. In order to speed up the scanning process, mballoc maintains the total number of available blocks and the first available block of each block group.



Subsections
next up previous
Next: Future plans Up: Extents, delayed allocation and Previous: Extents based delayed allocation
Mingming Cao 2005-07-26