Plain SSTable format¶

PlainTable is a RocksDB’s SST file format optimized for low query latency on pure-memory or really low-latency media.

Top-Level¶

label	type
data rows	row[N]
property
footer	fixed size

0. Row Format¶

Note

The format of data row.

label	type
encoded_key
value_size	varint32
value	char[value_size]

0.0 Key Encoding¶

Plain Encoding
- internal encoding with fixed given key size
- [length of key : varint32] + [user key] + [internal encoding] when without fixed given key size
Prefix Encoding

Note

Share the same prefix of keys to save size.

There are three type packets as below:

- Full Key: with the full key bytes. [full key flag + size] +
[full user key] + [internal encoding]
- Second Key: with the prefix size. [prefix key flag + size] +
[suffix key flag + size] + [suffix key] + [internal encoding]
- Others: with the suffix key bytes. [suffix key flag + size] +
[suffix key] + [internal encoding]

The [flag + size] is the byte with format as below:

[type(2b)|size(6b)]. But all bits of size will be set to 1 when size beyond the limit, and there will be varint32 writen after this and the size are the sum of 0x3F and value of variable size.The type are full key , second key and suffix key.

Internal Encoding

In both of Plain and Prefix encoding data, internal encoding of the internal are encoded in the same way. The internal encoding seems as below:

label	type	note
type	char	row type(value, delete, merge, etc.)
sequence ID	char[7]

This can be compressed as below when no previous value for this key in the system.

0x80

1. Property¶

1. data_size : the end of data part of the file.

2. fixed_key_len : length of the keys if all keys has the same length, 0 otherwise.

In-Memory Index¶

Warning

the In-Memory Index was built by scan the Plain SSTable file. So this is not a part of Plain SSTable file now.

On top level, In-memory Index is the hash table with each bucket to be either offset in the file or a binary search index. The binary search buffer is needed in two cases:

1. Hash Collisions: two or more prefixes are hashed to the same bucket.

2. Too many keys for one prefix: need to speed-up the look-up inside the prefix.

Format¶

The index consists of two piece fo memory: an array as hash buckets, and some binary search buffers.

record
Flag(1b)	Offset to binary search buffer or file(31b)

1. If Flag = 0 and Offset equals to the offset of end of the data of the file, it means NULL - no data for this bucket; if the offset is smaller, it means there is only one prefix for this hash bucket.

2. If Flag = 1, it means the offset is for binary search buffer.

The format of binary search buffer is as below:

label	type
number_of_records	varint32
records	fixed32[number_of_records]