Delayed Allocation
1. Background
2. Current status
3. Other related work
4. Know issues and proposed solutions
5. patches
6. related discussions
7. todo
1. Background
a. Why we need delayed allocation
1)Enables multiple block allocation on buffered IO, which reduce cpu
time,increase the chance of get contiguous blocks, and reduce the
number of meta data updates
2)Could avoid block allocation at all for temporary files
More see http://ext2.sourceforge.net/2005-ols/paper-html/node23.html
b. How it works
Basically delayed allocation defers allocation of blocks from
prepare_write() and employs extent walking, together with the multiple
block allocation feature (described in the next section), for
clustering block allocations maximally into contiguous blocks.
Instead of allocating the disk block in prepare_write(), the the page
is marked as needing block reservation. The commit_write() function
calculates the required number of blocks, and reserves them to make
sure that there are enough free blocks in the filesystem to satisfy the
write. When the pages get flushed to disk by writepage() or
writepages(), these functions will walk all the dirty pages in the
specified inode, cluster the logically contiguous ones, and submit the
page or pages to the bio layer. After the block allocation is complete,
the reservation is dropped. A single block I/O request (or BIO) is
submitted for write out of pages processed whenever a new allocated
extent (or the next mapped extent if already allocated) on the disk is
not adjacent to the previous one, or when writepages() completes. In
this manner the delayed allocation code is tightly integrated with
other features to provide best performance.
c. History
Andrew Morton has a patch for delayed allocation back to 2.5 time.
Ted Tso "why we need delayed allocation?"
http://marc.theaimsgroup.com/?l=ext2-devel&m=107239591117758&w=2
Alex Thomas post his first delayed allocation patch(with extent and
mblocks) at April 2004:
Showed significant improvements on sequential writes:
http://marc.theaimsgroup.com/?l=linux-kernel&m=108188483415612&w=2
his recent post on Jan 2005:
http://marc.theaimsgroup.com/?l=ext2-devel&m=110820138324595&w=2
d. What other linux filesystem support this feature?
XFS and reiser4, according to
http://en.wikipedia.org/wiki/Delayed_allocation
2. Current status
Prototype patch sent out to community mid of 2005. There are a
couple of issues to address for data=wrtieback with nobh. We haven't
come out with a plan for ordered mode. We also haven't decide whether
we will do generic delayed allocation support at VFS layer or we will
do it at ext3 layer.
3 other related work
a. multiple block allocation in mpage_writepages
Uses mpageio structure instead of passing lots of params as in hch's mpage
readpages patch
Status:
Know issues: (Suparna)
- Lock ordering issue between page lock and journalling if used for ext3
- Uses writepage helper argument for dealing with confused case (hch's
solution for mpage_readpages passes get_block() in addition to get_blocks())
- Unresolved bug when used with LILO seen on one system
- Code is a little complex as it tries to address both bh and nobh
- ext3 ordered mode support needs more thought
todo:
- Work with hch to ensure uniform approach for mpage_readpages and
mpage_writepages for better maintainability.
- Resolve issues
- Move prototype declaration to enable ext3 to use __mpage_writepage directly
- Obtain performance results with ext3 and xfs
- radix tree contiguous pages lookup could be optimised
b. ext3_writepages support
Status:
known issues:
- Duplicates generic mpage_writepages code, as it can't use the generic
routine directly because of lock/journal ordering issue
- ext3 ordered mode support yet to be implemented
todo:
- Decide whether to do this at generic level or live with duplication for now
- Decide on approach for ext3 ordered mode
- Test with LILO
c.support ext3 multiple block allocation
Status: pre-worked the patch and send out to Andrew on Jan 2006. Now
they are sitting in mm tree.
known issues: not known yet.
todo: work with hch for map multiple blocks for buffered IO at
mpage_readpages
4. known issues
a. Error Handling
Description: Need to reserve free space at prepare_write time, so when
flush pages to disk we will not fail because of full disk space.
possible solutions:
we need a page flag to indicate that this page need block allocation
later and has already reserved free space on disk for it. (Suparna,
correct me if this is wrong)
b. Support data=ordered mode
Description: (from Badari's email)
In case of ordered mode, commit_write() calls journal_diry_data()
which add buffer heads to transaction. This is needed since we need
to make sure the "page" gets flushed to disk before finishing the
metadata update (as part of commit transaction). So, if we want to
get rid of bufferhead, we need something to attach the page to
transaction. I thought about attaching "page" directly to the
t_sync_datalist - but then, when the time comes to flush it to
disk, I need to find out the disk block# mapping (which means I need
to do getblk()). Bufferhead is nicely caching all this stuff for
me and we can simply do ll_rw_blk() on the bh to writeout the page.
(question: it is possible to do delalloc with bh??)
possible solutions
1). Andrew Morton:
He proposed to toss the whole idea of attaching
journalled-data buffers
to the journal, instead, walk the dirty inodes list
at commit time and
flush all their pages out to disk, to guarantee the
ordering.
http://marc.theaimsgroup.com/?l=ext2-devel&m=111286563813799&w=2
concerns: a whole new journalling mode, code
intrusive
2). Stepehen Tweedie:
get rid of usage of bh in ordered
mode. Not sure what in his mind:-)
3). Alex Thoma's solution
In short: Just count bios and wait for them to
finish.
ext3_writepages() finds contiguous sets of dirty
pages, allocates blocks
for holes and increments special counter in current
transaction it opened
to fill holes. then it submits bio. At bio's
completion, that counter is
decreased. journal_commit_thread() waits for counter
== 0
http://marc.theaimsgroup.com/?l=ext2-devel&m=110994252122593&w=2
concerns: this method doesn't fit generic
mpage_writepages() as it has no
idea about transactions/journals etc.
(question: is there still problem is we choose to do
all page clustering at ext3_writepages()?)
4). Is it possible to have bufferheads and delalloc?
I am thinking probably we could still have
bufferheads, but at the commit time,
flushing all the dirty pages belongs to the inode
first, and then proceed the journalling
commitment.
Is this possible?
c. page size != block size
If power decide to set 64k as default page size, we
will need to support page size!=block for delalloc...
(question: what's the proposed solution here?)
5. Patches
http://marc.theaimsgroup.com/?l=ext2-devel&m=112162215406004&w=2
latest mblocks for ext3(2.6.15 based):
delalloc(2.6.13 based):
http://marc.theaimsgroup.com/?l=ext2-devel&m=112162217125811&w=2
mblocks for mpage_writepages:
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112162295603496&w=2
ext3_writepages() support for writeback mode
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112162261617853&w=2
(todo, we will move all the patches to ext2.sf.net/patches)
6. Other related discussions
Andrew Morton's delalloc patch (2002)
http://lwn.net/2002/0307/a/delayed-allocation.php3
Alex Thomas's delayed allocation:
http://www.ussg.iu.edu/hypermail/linux/kernel/0404.1/0853.html
latest code at:
ftp://ftp.clusterfs.com/pub/people/alex/2.6.11
7. To dos
1) port 2.6.13 based patches to current 2.6.15 level
2) add error handling in data=writeback mode. Need to reserve free
block at prepare write time, so at the time of page flushing, it will
not fail because of ENOSPC
3) evaluate possible solutions for data=ordered mode
3.1) evaluate the bufferheads usage in ordered mode, is it
possible to elimate
bufferheads usage in ordered mode
at all?
3.2) evaluate Andrew Morton's proposal
3.3) evaluate Alex Thomas' solution for data=ordered mode
3.3) seeking other options
4) evaluate xfs/reiser4 implementation of delayed allocation