a very occasional diary @ Nikita Danilov

reiser4: 1. internal tree

This continues previous entry: introduction

0. B-trees overview

[0] [1] [2]

B-tree is am umbrella term for a variety of similar algorithms used to implement dictionary abstraction suitable in a form suitable for external storage. Reiser4 version of this algorithms, officially known as a dancing tree is closest to the B+-tree. Basically, in B+-tree user-supplied data records are stored in leaf nodes. Each leaf node keeps records from some particular key range (adjusted dynamically). Each leaf node is referenced from its parent: this reference is a pointer to the child (block number usually) and the smallest key in the key range covered by this leaf. Parent node contains multiple references to its children nodes, again from some particular key range. Parent node may have its own parent and so on until whole key space is covered.

Compare this with B-trees proper, where data records are stored at the levels other than leaf one. Also compare this with radix tree (aka PATRICIA) where key range covered by given node is uniquely determined by its level rather than adjusted dynamically as in B-trees.

Contents of reiser4 internal tree are initially stored on the disk. As file system operations go on, some blocks from the tree are read into memory and cached there. When new node is added to the tree, it first exists as a block-size buffer in memory. Under some conditions portions of tree are written out from memory back to the stable storage. During write-out, nodes that were marked dirty (either as a result of a modification, or because they are newly created), are written to the appropriate disk block and marked clean. Clean nodes can be discarded from memory.

1. lookup and modification

[3]

Lookup operation locates a record with a given key in a tree, or locates a place where such record will be inserted (this place is uniquely identified by a key) if it doesn't exist yet. All other tree operations start with lookup, because they have to find a place in a tree identified by given key.

Roughly speaking, lookup starts from the top of the tree, searches through index node to find it which of its children given key falls, loads that child node and repeats until it gets to the leaf level. One can say that lookup is top-to-bottom tree operation.

This poses two questions:

What is the top of a tree and how to find it?

How to find out which child node to proceed with?

First question seems trivial, because every tree has a root node, right? The problem is that, as will be explained below, concurrent tree modification operation may grow the tree by adding new root to it (so that old root becomes child of a new one), or shrink it by killing existing root. To avoid this, reiser4 tree has special imaginary node, existing only in memory that logically hangs just above root node. First of all, lookup locks this node and learns location of root node from it. After this, it loads root node in memory and releases the lock. This imaginary node is called uber-node. Curiously enough, Sun ZFS uses exactly the same term for the very similar thing. They even have 0x00babl0c magic number in their disk format, that is supposed to be as close as possible to the pronunciation of "uber block" (even more funny, "bablo" is Russian slang word for money, bucks).

To make search for a child node containing given key efficient, keys in the parent node are stored in an ordered array, so that binary search can be used. That binary search is (under many workloads) the single most critical CPU operation.

Now we known enough to code reiser4 lookup algorithm (search.c:traverse_tree()):

traverse_tree(key) {
        node     *parent; /* parent node */
        node     *child;  /* child node */
        position  ptr;    /* position of record within node */

        child = ubernode;
        ptr   = ubernode.root; /* put root block number in ptr */

        while (!node_is_leaf(child)) {
                parent = child;
                /* load node, stored at the given block, into memory */
                child = node_at(ptr);
                lock(child);
                ptr = binary_search(child, key);
                unlock(parent);
        }
        return ptr;
}

Few comments:

position identifies a record within node. For leaf level this is actual record with user data, for non-leaf levels, record contains pointers to the child node.

note that lock on the child node is acquired before lock on the parent is released, that is, two locks are held at some moment. This is traditional tree traversal technique called lock coupling or lock crabbing.

traverse_tree() returns with ptr positioned at the record with required key (or at the place where such record would be inserted) and with appropriate leaf node locked.

As was already mentioned above, tree lookup is comparatively expensive (computationally, if nothing more) operation. At the same time, almost everything in reiser4 uses it extensively. As a result, significant effort has been put into making tree lookup more efficient. Algorithm itself doesn't leave much to be desired: its core is binary search and it can be optimized only that far. Instead, multiple mechanisms were developed that allows to bypass full tree lookup under some conditions. These mechanisms are:

seals

vroot

look-aside cache (cbk cache)

1.1. seals

[4]

A seal is a weak pointer to a record in the tree. Every node in a tree maintains version counter, that is incremented on each node modification. After lookup for a given key was performed, seal can be created that remembers block number of found leaf node and its version counter at the moment of lookup. Seal verification process checks that node recorded in the seal is still in the tree and that its version counter is still the same as recorded. If both conditions are met, pointer to the record returned by lookup can still be used, and additional lookup for the same key can be avoided.

1.2. vroot

[5]

Higher-level file system object such as regular files and directories is represented as a set of tree records. Keys of these records are usually confined in some key range, and, due to the nature of B-trees, are all stored in the nodes having common ancestor node that is not necessary root. That is, records constituting given object are located in some subtree of reiser4 internal tree. Idea of vroot (virtual root) optimization is to track root of that sub-tree and to start lookups for object records from vroot rather than from real tree root. vroot is updated lazily: when lookup finds that tree was modified so that object subtree is no longer rooted at vroot, tree traversal restarts from real tree root and vroot is determined during descent.

Additional important advantage of vroot is that is decreases lock contention for the root node.

1.3. look-aside cache

[6]

Look-aside cache is simply a list of last N leaf nodes returned by tree lookup. This list is consulted before embarking into full-blown top-to-bottom traversal. This simple mechanism works due to locality of reference for tree accesses.

To be continued:

2. concurrency control: lock ordering, hi-lo ordering 3. eottl 4. node format

reiser4: 0. introduction

[0]

This is an introduction to the series of entries I am going to write about reiser4 file system. The plan, for the time being, goes like this:

0. introduction

0. trivia

1. reiser4 strong points

2. architecture overview

1. internal tree

0. B-trees overview

1. lookup and modification

2. concurrency control: lock ordering, hi-lo ordering

3. eottl

4. node format

2. tree clients

0. regular file

1. tail to extent conversion

3. transaction engine

0. goals: atomicity, no isolation, durability

1. log structure

2. shadow updates

4. directory code

0. overview of hashed directories

1. key structure and its motivation

2. lookup readdir

3. NFS?

5. delayed allocation and the flush algorithm

0. delayed allocation motivation

1. flush algorithm

2. overview of interaction with VM

3. emergency flush

6. round up

0. trivia

[1]

Reiser4 is a relatively new file system for Linux developed by namesys. Its development started early in 2001, and currently reiser4 is included into -mm patch series. Reiser4 is a next version of well-known reiserfs v3 file system (also known as simply reiserfs) that was first and for a long time the only journalled file system for Linux. While reiser4 is based on the same basic architectural ideas as reiserfs v3 it was written completely from scratch and its on-disk format is not compatible with reiserfs. Note on naming: namesys insists that file system is referred to as reiser4 and not reiserfs v4, or reiser4fs, or reiserfs4, or r4fs. Development of reiser4 was sponsored by DARPA, take a look at the namesys front page for a legal statement, and a list of other sponsors.

Technically speaking reiser4 can be briefly discussed as a file system using

global shared balanced lazily compacted tree to store all data and meta-data

bitmaps for block allocation

redo-only WAL (write ahead logging) with shadows updates (called "wandered logs" in reiser4)

delayed allocation in parent-first tree ordering

I shall (hopefully) describe these topics in more details in the future entries.

One last trivia point: I was amongst reiser4 developers from the very beginning of this project. At the mid-2004 I left namesys, at that point reiser4 was in more or less debugged state, and no significant design changes occurred since.

1. reiser4 strong points

[2]

Together with the unique design, reiser4 inherited reiserfs' strengths and weaknesses:

efficient support for very small files: multiple small files can be stored in the same disk block. This saves disk space, and much more importantly IO traffic between secondary and primary storage;

support for very large directories: due to balanced tree (to be described later), file system can handle directories with hundreds of million of files. The utility of this (and small files support) is often questioned, because "nobody uses file system in that way". Reiser4 design is based on the assumption, that cause-and-effect goes in the opposite direction here: applications do not use large directories and small files precisely because existing file system didn't provide efficient means for this. Instead, application developers had to resort to various hacks from configuration files (that have obvious hierarchical structure asking for being mapped onto file system), to artificially splitting large directory into smaller ones (look at /usr/share/terminfo/ directory as an example);

no hard limit on the number of objects on the file system. Traditional UNIX file systems (s5fs,ffs,ufs,ext2,ext3) fix total number of objects that can exist on a file system during file system creation. Correct estimation of this number is a tricky business. Reiser4, on the other hand, creates on-disk inodes (called "stat data" for historical reasons) dynamically;

previous item can be seen as a weakness, because it means that reiser4 on-disk format is more complex than that of its brethren. Other advanced reiser4 features also come not without an increase of format complexity. As a result, reiser4 can be corrupted in much more complicated ways than other file systems, requiring quite sophisticated fsck.

2. architecture overview

[3]

As was already hinted above reiser4 is quite different from other file systems architecturally. The most important difference, perhaps, is that reiser4 introduces another layer ("storage layer") between file system and block device. This layer provides an abstraction of persistent "container" into which pieces of data (called "units") can be placed and later retrieved. Units are identified by "keys": when unit is placed into container a key is given, and unit can later be retrieved by providing its key. Unit key is just a fixed-length sequence of bits.

In reiser4 this container abstraction is implemented in a form of special flavor of B-tree. There is one instance of such tree (referred to as "internal tree") per file system.

Entities from which file system is composed (regular files, directories, symlinks, on-disk inodes, etc.) are stored in the internal tree as sequences of units.

In addition to presenting container abstraction, storage layer is responsible for block allocation. That is, as units are inserted and deleted, tree grows and shrinks, taking blocks from block allocator and returning them back. Block allocation policy implemented by storage layer is very simple:

units with close keys are kept as close on disk as possible, and

units should be placed in the disk blocks so that key ordering corresponds to the block ordering on disk.

This allows higher layers of the system to influence location of units on disk by selecting unit keys appropriately.

While this doesn't sound especially exciting, key-based block allocation allows reiser4 to implement one extremely important feature: global block allocation policies. For example, by assigning to units of all objects in a given directory keys with the same prefix, higher layer code can assure that all objects located in this directory will be located close to each other on the disk. Moreover, by selecting next few bits of key based on --say-- file name it is possible to arrange all objects in the directory in the same order as readdir returns their names, so that commands like

$ cp -a * /somewhere

will not cause a single seek during read. More on this in the section on "delayed allocation and the flush algorithm".

So the highest layer in reiser4 implements file system semantics using storage layer as a container. Storage layer uses block allocator to manage disk space and transaction engine to keep file system consistent in case of a crash. Transaction layer talks to block device layer to send data to the storage.

To Be Continued by: internal tree.