Journaling, or write-ahead logging is a sophisticated solution to the problem of file system inconsistency in operating systems. Inspired by database management systems, this method first writes down a summary of the actions to be performed into a “log” before actually writing them to the disk. Hence the name, “write-ahead logging”. In the case of a crash, the OS can simply check this log and pick up from where it left off. This saves multiple disk scans to fix inconsistency, as is the case with FSCK.
Good examples of systems that implement data journaling include Linux ext3 and ext4 file systems, and Windows NTFS.
A log is stored in a simple data structure called the journal. The figure below shows its structure, which comprises of three components.
- TxB (Transaction Begin Block):
This contains the transaction ID, or the TID.
- Inode, Bitmap and Data Blocks (Metadata):
These three blocks contain a copy of the contents of the blocks to be updated in the disk.
- TxE (Transaction End Block)
This simply marks the end of the transaction identified by the TID.
As soon as an update is requested, it is written onto the log, and thereafter onto the file system. Once all these writes are successful, we can say that we have reached the checkpoint and the update is complete.
What if a crash occurs during journaling ?
One could argue that journaling, itself, is not atomic. Therefore, how does the system handle an un-checkpointed write ? To overcome this scenario, journaling happens in two steps: simultaneous writes to TxB and the following three blocks, and then write of the TxE. The process can be summarized as follows.
- Journal Write:
Write TxB, inode, bitmap and data block contents to the journal (log).
- Journal Commit:
Write TxE to the journal (log).
Write the contents of the inode, bitmap and data block onto the disk.
A crash may occur at different points during the process of journaling. If a crash occurs at step 1, i.e. before the TxE, we can simply skip this transaction altogether and the file system stays consistent.
If a crash occurs at step 2, it means that although the transaction has been logged, it hasn’t been written onto the disk completely. We cannot be sure which of the three blocks (inode, bitmap and data block) were actually updated and which ones suffered a crash. In this case, the system scans the log for recent transactions, and performs the last transaction again. This does lead to redundant disk writes, but ensures consistency. This process is called redo logging.
Using the Journal as a Circular Buffer:
Since many transactions are made, the journal log might get used up. To address this issue, we can use the journal log as a circular buffer wherein newer transactions keep replacing the old ones in a circular manner. The figure below shows an overall view of the journal, with tr1 as the oldest transaction and tr5 the newest.
The super block maintains pointers to the oldest and the newest transactions. As soon as the transaction is complete, it is marked as “free” and the super block is updated to the next transaction.
Attention reader! Don’t stop learning now. Get hold of all the important CS Theory concepts for SDE interviews with the CS Theory Course at a student-friendly price and become industry ready.