Name: RoaringFormatSpec
Owner: Roaring bitmaps: A better compressed bitset
Description: Specification of the compressed-bitmap Roaring format
Created: 2016-12-14 18:50:16.0
Updated: 2018-05-12 09:36:51.0
Pushed: 2017-09-16 00:23:03.0
Homepage: http://roaringbitmap.org/
Size: 228
Language: null
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
Roaring bitmaps are used by several important systems:
Many of these systems use the following interoperable format.
This specification assumes that you are familiar with Roaring bitmaps. Please refer to the following paper for details on the design rationale:
Let us recap that Roaring bitmaps are designed to store sets of 32-bit (unsigned) integers. Thus a Roaring bitmap can contain up to 4294967296 integers. They are made of three types of 16-bit containers: array, bitset and run containers. There are between 1 and 65536 (inclusively) containers. Each container has a cardinality (value in [1, 65536]), and it has a 16-bit most significant value (also called “key”) in [0,65536). All containers are non-empty.
There are extensions of Roaring Bitmaps for storing 64-bit integers. However, the current format specification only relates to 32-bit integers.
All words are written using little endian encoding.
The layout is designed so that random access to the data is possible without materializing the bitmap in memory while being storage efficient.
Throughout, we use the following constants :
The below diagram represents a storage file containing both run containers and the offset header.
The cookie header spans either 32 bits or 64 bits.
size
be the number of containers. Then we store (size + 7) / 8
bytes, following the initial 32 bits, as a bitset to indicate whether each of the containers is a run container (bit set to 1) or not (bit set to 0). The first (least significant) bit of the first byte corresponds to the first stored container and so forth.Thus if follows that the least significant 16 bits of the first 32 bits of a serialized bitmaps should either have the value SERIAL_COOKIE_NO_RUNCONTAINER or the value SERIAL_COOKIE. In other cases, we should abort the decoding.
After scanning the cookie header, we know how many containers are present in the bitmap.
The cookie header is followed by a descriptive header. For each container, we store the key (16 most significant bits) along with the cardinality minus 1, using 16 bits for each value (for a total of 32 bits per container).
Thus, if there are x containers, the descriptive header will contain 32 x bits or 4 x bytes.
After scanning the descriptive header, we know the type of each container. Indeed, if the cookie took value SERIAL_COOKIE, then we had a bitset telling us which containers are run containers; otherwise, we know that there are no run containers. For the containers that are not run containers, then we use the cardinality to determine the type: a cardinality of up to 4096 indicates an array container whereas a cardinality above 4096 indicates a bitset container.
If and only if one of these is true
then we store (using a 32-bit value) the location (in bytes) of the container from the beginning of the stream (starting with the cookie) for each container.
The containers are then stored one after the other.
At a minimum, all Roaring implementations should be able to parse the two files in the testdata
directory, see the associated README.md file found in the directory.
int startOffset = 0;
boolean hasrun = hasRunContainer();
if (hasrun) {
out.writeInt(Integer.reverseBytes(SERIAL_COOKIE | ((size - 1) << 16)));
byte[] bitmapOfRunContainers = new byte[(size + 7) / 8];
for (int i = 0; i < size; ++i) {
if (this.values[i] instanceof RunContainer) {
bitmapOfRunContainers[i / 8] |= (1 << (i % 8));
}
}
out.write(bitmapOfRunContainers);
if (this.size < NO_OFFSET_THRESHOLD) {
startOffset = 4 + 4 * this.size + bitmapOfRunContainers.length;
} else {
startOffset = 4 + 8 * this.size + bitmapOfRunContainers.length;
}
} else { // backwards compatibility
out.writeInt(Integer.reverseBytes(SERIAL_COOKIE_NO_RUNCONTAINER));
out.writeInt(Integer.reverseBytes(size));
startOffset = 4 + 4 + 4 * this.size + 4 * this.size;
}
for (int k = 0; k < size; ++k) {
out.writeShort(Short.reverseBytes(this.keys[k]));
out.writeShort(Short.reverseBytes((short) (this.values[k].getCardinality() - 1)));
}
if ((!hasrun) || (this.size >= NO_OFFSET_THRESHOLD)) {
// writing the containers offsets
for (int k = 0; k < this.size; k++) {
out.writeInt(Integer.reverseBytes(startOffset));
startOffset = startOffset + this.values[k].getArraySizeInBytes();
}
}
for (int k = 0; k < size; ++k) {
values[k].writeArray(out);
}
char *initbuf = buf;
uint32_t startOffset = 0;
bool hasrun = ra_has_run_container(ra);
if (hasrun) {
uint32_t cookie = SERIAL_COOKIE | ((ra->size - 1) << 16);
memcpy(buf, &cookie, sizeof(cookie));
buf += sizeof(cookie);
uint32_t s = (ra->size + 7) / 8;
uint8_t *bitmapOfRunContainers = (uint8_t *)calloc(s, 1);
assert(bitmapOfRunContainers != NULL); // todo: handle
for (int32_t i = 0; i < ra->size; ++i) {
if (get_container_type(ra->containers[i], ra->typecodes[i]) ==
RUN_CONTAINER_TYPE_CODE) {
bitmapOfRunContainers[i / 8] |= (1 << (i % 8));
}
}
memcpy(buf, bitmapOfRunContainers, s);
buf += s;
free(bitmapOfRunContainers);
if (ra->size < NO_OFFSET_THRESHOLD) {
startOffset = 4 + 4 * ra->size + s;
} else {
startOffset = 4 + 8 * ra->size + s;
}
} else { // backwards compatibility
uint32_t cookie = SERIAL_COOKIE_NO_RUNCONTAINER;
memcpy(buf, &cookie, sizeof(cookie));
buf += sizeof(cookie);
memcpy(buf, &ra->size, sizeof(ra->size));
buf += sizeof(ra->size);
startOffset = 4 + 4 + 4 * ra->size + 4 * ra->size;
}
for (int32_t k = 0; k < ra->size; ++k) {
memcpy(buf, &ra->keys[k], sizeof(ra->keys[k]));
buf += sizeof(ra->keys[k]);
uint16_t card =
container_get_cardinality(ra->containers[k], ra->typecodes[k]) - 1;
memcpy(buf, &card, sizeof(card));
buf += sizeof(card);
}
if ((!hasrun) || (ra->size >= NO_OFFSET_THRESHOLD)) {
// writing the containers offsets
for (int32_t k = 0; k < ra->size; k++) {
memcpy(buf, &startOffset, sizeof(startOffset));
buf += sizeof(startOffset);
startOffset =
startOffset +
container_size_in_bytes(ra->containers[k], ra->typecodes[k]);
}
}
for (int32_t k = 0; k < ra->size; ++k) {
buf += container_write(ra->containers[k], ra->typecodes[k], buf);
}
Java lacks native unsigned integers, but integers are still considered to be unsigned within Roaring, and ordered according to Integer.compareUnsigned
.
For integers in [0,2147483647], the unsigned and signed integers are undistinguisable since Java relies a 32-bit two's complement format for its int
type.