summaryrefslogtreecommitdiff
path: root/doc/permdb.md
blob: bb73903e8732b822d8e8fa8d6540ea94db80cb1c (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
permdb format
=============

The permdb format is one of the formats for key-value stores.

Each permdb key-value store has two files, the data file and the index
file.

The data file has the same name as the base name of the store. It
contains the keys and the data for each entry. It is append-only and
contains enough information for fast integrity detection and rollback.

The index file has the name of the data file with ".idx" added. It
contains an index tree referring to the entries in the data file. It
only contains partial key information, and cannnot be used on its own
to verify the existence or non-existence of a key. It is append-only
and does not support rollback, instead the whole file is removed if
corruption is detected.

All keys in a permdb key-value store are of the same length, and
expected to be distributed somewhat evenly across the key space.

Parameters
----------

- blocksize [32-bit]
- q (bits per tree level) [32-bit], where 2^q = children per index node
- keylength [32-bit]


Data file
---------

The data file is written as a series of commits. Each commit is
fsynced before writing the next commit. Therefore, if a commit is not
fully fsynced, it must be the last one. This means that the whole data
file can be checked by just checking the last commit. The verification
information of a commit is placed in a trailer.

To make it possible to store data directly on a block device, the data
file must never contain more than *blocksize - 1* contiguous zero
bytes. Therefore, the data in an entry is chunked.

All values are in big-endian byte order.

- file
    - file cookie [64-bit] = 0xd53551ba539a4252
    - parameters
    - commits
        - commit
        - commit
        - ...

- commit
    - entries
        - entry
        - entry
        - ...
    - padding to 4-byte boundary
    - length [32-bit]
    - checksum = SHA-256 of whole commit except checksum and cookie
    - commit cookie [64-bit] = 0x2b05eed61b5af550

- entry
    - entry cookie [64-bit] = 0xe7c1cdc2ba3dc77c
    - key = (keylength bytes)
    - number of chunks [16-bit]
    - chunks
        - chunklength [16-bit]
        - data


Index file
----------

The index file stores a tree where the leaf nodes are offsets into the
data file. Since the file is append-only, a write operation has to
write all parts of the tree that changes, including the root node. The
root node is written last, which means that it is easy to find by
searching backwards from the end of file. The index file accumulates
garbage as more entries are written to it, but can be removed and
re-created to optimize its size, since it contains no original
information.

All values are in host-endian byte order.

- file
    - file cookie [64-bit] = 0xb7e16b02ba8a6d1b
    - commits
        - commit
        - commit
        - ...

- commit
    - nodes
        - node
        - node
        - ...
    - length [64-bit]
    - checksum = SHA-256 of whole commit except checksum and cookie
    - commit cookie [64-bit] = 0x2fb1778c74a402e4

- node
    - node cookie [64-bit] = 0x2e0f555ad73210d1
    - children
        - child 0
        - child 1
        - ...
        - child q^2-1

- child [64-bit]

    Any of these:
    1. If value is 0x0000000000000000: An empty node
       (nothing exists beyond this node).
    2. If MSB = 0: Intermediate node. An offset into the index file.
    3. If MSB = 1: Leaf node. An offset into the data file.

### Tree

The last node in any commit is the root node for that commit. Each
level corresponds to q bits of the key, starting with the root level
at the MSB of the first byte.


Integrity detection
-------------------

1. Check cookie.
2. Read last commit using length field in trailer.
3. Verify checksum.


Recovery (only for data file)
-----------------------------

1. Search backwards for cookie.
2. The cookie is always 4-byte aligned.
3. Check that commit and truncate file after last complete commit.