permdb format ============= The permdb format is one of the formats for key-value stores. Each permdb key-value store has two files, the data file and the index file. The data file has the same name as the base name of the store. It contains the keys and the data for each entry. It is append-only and contains enough information for fast integrity checking and rollback. The index file has the name of the data file with ".idx" added. It contains an index tree referring to the entries in the data file. It only contains partial key information, and cannnot be used on its own to verify the existence or non-existence of a key. It is append-only and does not support rollback, instead the whole file is removed if corruption is detected. All keys in a permdb key-value store are of the same length, and expected to be distributed somewhat evenly across the key space. Parameters ---------- - blocksize [32-bit] - q (bits per tree level) [32-bit], where 2^q = children per index node - keylength [32-bit] Data file --------- The data file is written as a series of commits. Each commit is fsynced before writing the next commit. Therefore, if a commit is not fully fsynced, it must be the last one. This means that the whole data file can be checked by just checking the last commit. The verification information of a commit is placed in a trailer. To make it possible to store data directly on a block device, the data file must never contain more than *blocksize - 1* contiguous zero bytes. Therefore, the data in an entry is chunked. All values are in big-endian byte order. - file - file cookie [64-bit] = 0x99c99d588b1e7983 - parameters - commits - commit - commit - ... - commit - entries - entry - entry - ... - commit start cookie [64-bit] = 0x75c2e4b3d5f643a1 - padding to 4-byte boundary - length [64-bit] - checksum = SHA-256 of whole commit except checksum and end cookie - commit end cookie [64-bit] = 0x2b05eed61b5af550 - entry - entry cookie [64-bit] = 0xe7c1cdc2ba3dc77c - key (keylength bytes) - number of chunks [16-bit] - chunks - chunklength [16-bit] - data Index file ---------- The index file stores a tree where the leaf nodes are offsets into the data file. Since the file is append-only, a write operation has to write all parts of the tree that changes, including the root node. The root node is written last, which means that it is easy to find by searching backwards from the end of file. The index file accumulates garbage as more entries are written to it, but can be removed and re-created to optimize its size, since it contains no original information. All values are in big-endian byte order. - file - file cookie [64-bit] = 0xb7e16b02ba8a6d1b - commits - commit - commit - ... - commit - nodes - node - node - ... - length [64-bit] - checksum = SHA-256 of whole commit except checksum and cookie - commit cookie [64-bit] = 0x2fb1778c74a402e4 - node - node cookie [64-bit] = 0x2e0f555ad73210d1 - children - child 0 - child 1 - ... - child q^2-1 - child [64-bit] Any of these: 1. If value is 0x0000000000000000: An empty node (nothing exists beyond this node). 2. If MSB = 0: Intermediate node. An offset into the index file. 3. If MSB = 1: Leaf node. An offset into the data file. ### Tree The last node in any commit is the root node for that commit. Each level corresponds to q bits of the key, starting with the root level at the MSB of the first byte. Integrity checking ------------------ 1. Check cookie. 2. Read last commit using length field in trailer. 3. Verify checksum. Recovery (only for data file) ----------------------------- 1. Search backwards for cookie. 2. The cookie is always 4-byte aligned. 3. Check that commit and truncate file after last complete commit.