The scalability improvements in the block layer and other portions of the kernel during 2.5 development uncovered a scaling problem for ext3/JBD under parallel I/O load. To address this issue, Alex Tomas and Andrew Morton worked to remove a per-filesystem superblock lock (lock_super()) from ext3 block allocations [13].
This was done by deferring the filesystem's accounting of the number of free inodes and blocks, only updating these counts when they are needed by statfs() or umount() system call. This lazy update strategy was enabled by keeping authoritative counters of the free inodes and blocks at the per-block group level, and enabled the replacement of the filesystem-wide lock_super() with fine-grained locks. Since a spin lock for every block group would consume too much memory, a hashed spin lock array was used to protect accesses to the block group summary information. In addition, the need to use these spin locks was reduced further by using atomic bit operations to modify the bitmaps, thus allowing concurrent allocations within the same group.
After addressing the scalability problems in the ext3 code proper, the focus moved to the journal (JBD) routines, which made extensive use of the big kernel lock (BKL). Alex Tomas and Andrew Morton worked together to reorganize the locking of the journaling layer in order to allow as much concurrency as possible, by using a fine-grained locking scheme instead of using the BKL and the per-filesystem journal lock. This fine-grained locking scheme uses a new per-bufferhead lock (BH_JournalHead), a new per-transaction lock (t_handle_lock) and several new per-journal locks (j_state_lock, j_list_lock, and j_revoke_lock) to protect the list of revoked blocks. The locking hierarchy (to prevent deadlocks) for these new locks is documented in the include/linux/jbd.h header file.
The final scalability change that was needed was to remove the use of sleep_on() (which is only safe when called from within code running under the BKL) and replacing it with the new wait_event() facility.
These combined efforts served to improve multiple-writer performance on ext3 noticeably: ext3 throughput improved by a factor of 10 on SDET benchmark, and the context switches are dropped significantly [2,13].