Delayed Allocation

1. Background
2. Current status
3. Other related work
4. Know issues and proposed solutions
5. patches
6. related discussions
7. todo



1. Background

a. Why we need delayed allocation

1)Enables multiple block allocation on buffered IO, which reduce cpu time,increase the chance of get contiguous blocks, and reduce the number of meta data updates
2)Could avoid block allocation at all for temporary files

More see http://ext2.sourceforge.net/2005-ols/paper-html/node23.html

b. How it works
Basically delayed allocation defers allocation of blocks from prepare_write() and employs extent walking, together with the multiple block allocation feature (described in the next section), for clustering block allocations maximally into contiguous blocks.

Instead of allocating the disk block in prepare_write(), the the page is marked as needing block reservation. The commit_write() function calculates the required number of blocks, and reserves them to make sure that there are enough free blocks in the filesystem to satisfy the write. When the pages get flushed to disk by writepage() or writepages(), these functions will walk all the dirty pages in the specified inode, cluster the logically contiguous ones, and submit the page or pages to the bio layer. After the block allocation is complete, the reservation is dropped. A single block I/O request (or BIO) is submitted for write out of pages processed whenever a new allocated extent (or the next mapped extent if already allocated) on the disk is not adjacent to the previous one, or when writepages() completes. In this manner the delayed allocation code is tightly integrated with other features to provide best performance.

c. History
Andrew Morton has a patch for delayed allocation back to 2.5 time.
Ted Tso "why we need delayed allocation?"
http://marc.theaimsgroup.com/?l=ext2-devel&m=107239591117758&w=2
Alex Thomas post his first delayed allocation patch(with extent and mblocks) at April 2004:
Showed significant improvements on sequential writes:

http://marc.theaimsgroup.com/?l=linux-kernel&m=108188483415612&w=2
his recent post on Jan 2005:
http://marc.theaimsgroup.com/?l=ext2-devel&m=110820138324595&w=2


d. What other linux filesystem support this feature?
   XFS and reiser4, according to
    http://en.wikipedia.org/wiki/Delayed_allocation



2. Current status

Prototype patch sent out to community mid of 2005. There are a couple of issues to address for data=wrtieback with nobh. We haven't come out with a plan for ordered mode. We also haven't decide whether we will do generic delayed allocation support at VFS layer or we will do it at ext3 layer.
 


3 other related work

a. multiple block allocation in mpage_writepages
Uses mpageio structure instead of passing lots of params as in hch's mpage readpages patch
Status: Know issues: (Suparna) - Lock ordering issue between page lock and journalling if used for ext3 - Uses writepage helper argument for dealing with confused case (hch's solution for mpage_readpages passes get_block() in addition to get_blocks()) - Unresolved bug when used with LILO seen on one system - Code is a little complex as it tries to address both bh and nobh - ext3 ordered mode support needs more thought
todo: - Work with hch to ensure uniform approach for mpage_readpages and mpage_writepages for better maintainability.
- Resolve issues - Move prototype declaration to enable ext3 to use __mpage_writepage directly - Obtain performance results with ext3 and xfs - radix tree contiguous pages lookup could be optimised b. ext3_writepages support
Status: known issues: - Duplicates generic mpage_writepages code, as it can't use the generic routine directly because of lock/journal ordering issue - ext3 ordered mode support yet to be implemented
todo: - Decide whether to do this at generic level or live with duplication for now - Decide on approach for ext3 ordered mode - Test with LILO
c.support ext3 multiple block allocation
Status: pre-worked the patch and send out to Andrew on Jan 2006. Now they are sitting in mm tree.
known issues: not known yet.
todo: work with hch for map multiple blocks for buffered IO at mpage_readpages



4. known issues

a. Error Handling
Description: Need to reserve free space at prepare_write time, so when flush pages to disk we will not fail because of full disk space.
possible solutions:
we need a page flag to indicate that this page need block allocation later and has already reserved free space on disk for it. (Suparna, correct me if this is wrong)

b. Support data=ordered mode
Description: (from Badari's email)
In case of ordered mode, commit_write() calls journal_diry_data()
which add buffer heads to transaction. This is needed since we need
to make sure the "page" gets flushed to disk before finishing the
metadata update (as part of commit transaction). So, if we want to
get rid of bufferhead, we need something to attach the page to
transaction. I thought about attaching "page" directly to the
t_sync_datalist - but then, when the time comes to flush it to
disk, I need to find out the disk block# mapping (which means I need
to do getblk()). Bufferhead is nicely caching all this stuff for
me and we can simply do ll_rw_blk() on the bh to writeout the page.

(question: it is possible to do delalloc with bh??)

possible solutions
    1). Andrew Morton:
    He proposed to toss the whole idea of attaching journalled-data buffers
    to the journal, instead, walk the dirty inodes list at commit time and
    flush all their pages out to disk, to guarantee the ordering.
        http://marc.theaimsgroup.com/?l=ext2-devel&m=111286563813799&w=2
    concerns: a whole new journalling mode, code intrusive

    2). Stepehen Tweedie:
        get rid of usage of bh in ordered mode. Not sure what in his mind:-)

    3). Alex Thoma's solution
    In short: Just count bios and wait for them to finish.
    ext3_writepages() finds contiguous sets of dirty pages,  allocates blocks
    for holes and increments special counter in current transaction it opened
    to fill holes. then it submits bio. At bio's completion, that counter is
    decreased. journal_commit_thread() waits for counter == 0

    http://marc.theaimsgroup.com/?l=ext2-devel&m=110994252122593&w=2    
    concerns: this method doesn't fit generic mpage_writepages() as it has no
    idea about transactions/journals etc.
    
    (question: is there still problem is we choose to do all page clustering at ext3_writepages()?)

    4). Is it possible to have bufferheads and delalloc?
    I am thinking probably we could still have bufferheads, but at the commit time,
    flushing all the dirty pages belongs to the inode first, and then proceed the journalling
    commitment.

    Is this possible?

c. page size != block size
    If power decide to set 64k as default page size, we will need to support page size!=block for delalloc...

    (question: what's the proposed solution here?)

        

5. Patches

http://marc.theaimsgroup.com/?l=ext2-devel&m=112162215406004&w=2

latest mblocks for ext3(2.6.15 based):

delalloc(2.6.13 based):
http://marc.theaimsgroup.com/?l=ext2-devel&m=112162217125811&w=2

mblocks for mpage_writepages:
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112162295603496&w=2

ext3_writepages() support for writeback mode
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=112162261617853&w=2

(todo, we will move all the patches to ext2.sf.net/patches)


6. Other related discussions

Andrew Morton's delalloc patch (2002)
http://lwn.net/2002/0307/a/delayed-allocation.php3

Alex Thomas's delayed allocation:
http://www.ussg.iu.edu/hypermail/linux/kernel/0404.1/0853.html
latest code at:
ftp://ftp.clusterfs.com/pub/people/alex/2.6.11


7. To dos

1) port 2.6.13 based patches to current 2.6.15 level
2) add error handling in data=writeback mode. Need to reserve free block at prepare write time, so at the time of page flushing, it will not fail because of ENOSPC
3) evaluate possible solutions for data=ordered mode
  3.1) evaluate the bufferheads usage in ordered mode, is it possible to elimate
       bufferheads usage in ordered mode at all?
  3.2) evaluate Andrew Morton's proposal
  3.3) evaluate Alex Thomas' solution for data=ordered mode
  3.3) seeking other options
4) evaluate xfs/reiser4 implementation of delayed allocation