Introducing Bitables

01 January 2015

An Introduction to Bitables and the Bitable Library

Bitables are a storage oriented data structure designed to exploit the benefits of OS optimization to provide a low overhead mechanism for fast range queries and ordered scans on an immutable key-value data-set. The immutability means they can be queried from multiple threads without locking. They’re also designed to be written out fast, using sequential writes, even with large amounts of keys and data. You can think of them as a hybrid between an SSTable and a b+ tree (although, currently they do not allow duplicate keys).

You can call this a thought experiment, because as I can imagine lots of different use cases for bitables, I have not had time yet to adequately explore them. However, preliminary benchmarks indicate that writing out a bitable is fast enough that they can be used for relatively dynamic structures, not just static ones. Some potential uses would be:

A lucene like inverted index (with segments and merging), when combined with something like roaring bitmaps.
A de-amortized log structured COLA like structure (with bitables at deeper levels and the higher levels stored in memory).
A static index where range or prefix searches are required.

Merging bitables is quite a quick operation, so they are suitable many places large SSTables would be used.

The basic premise is that bitables are built by adding keys/value pairs in key sorted order (like an SSTable), but they progressively build up a hierarchical index during this process (using very little in the way of extra resources to do so). Note that the library assumes you appending keys in an already sorted order. Writes of the hierarchical index are amortized across many keys and the operating system is left to buffer them up (the index being substantially smaller than the main data) to do as a batch.

There is a bitable library implementation, written in C (although, the included example is C++). This library serves as a relatively simple reference implementation, although it is designed to be usable for real life use. The library has both posix and win32 implementations, although it has only currently been tested and built on Windows and Linux (ubuntu). I don’t currently claim that the library is production ready, as it hasn’t been used fully in anger yet.

Here is an example of using the library to search for the lower bound of a range in a bitable, then scan forwards, reading 100 keys and values:

BitableCursor cursor;
BitableResult searchResult = bitable_find( &cursor, table, &searchKey, BFO_LOWER );

for ( int where = 0; 
      searchResult == BR_SUCCESS && where < 100; 
      searchResult = bitable_next( &cursor, table ), ++where )  
{
	BitableValue key;
	BitableValue value;

	bitable_key_value_pair( &cursor, &key, &value );

	/* Do something here with the key and value */   
}

How does it work?

Bitables consist of multiple levels:

The leaf level, which is the main ordered store for keys and small values.
The branch level(s), which are a hierarchical sorted index into the leaf levels.
The large value store, which stores values too large to fit in the leaf level, in sorted order.

Currently each level is stored in a separate file. When building the bitable, each file is only ever appended to (except when we write a final header page on completion), so writes to the individual files are sequential. The operating system is left to actually write these files to disk as it chooses.

The leaf and branch levels are organised into pages. Within each page, there is a sorted set of keys (and values for leaf pages), which can be searched using a binary search.

Building the bitable uses an algorithm similar to a bulk insertion into a b+ tree. Keys and small values are added to a leaf page (buffered in memory), as they are appended and when the page is full, the buffer is appended to the file and the next page begun. When a new page is begun, the first key is added to the level above it. New levels are created when the highest current level grows to more than one page.

Similar to a b+ tree, the first key in each branch level page is implicit and not stored. Unlike a b+ tree, branch level pages do not have a child pointer for every key, but only to the first child page. Because the pages in the level below are in order, we can work out the index to the page based on the result of the binary search.

Because we don’t have to worry about future inserts or page splitting, every page is as full as possible given the ordering, which should lead to the minimum possible number of levels for the structure (within constraints). This keeps the number of pages needing to be touched for any one query at a minimum, reducing I/O, page faults, TLB misses and increasing the efficiency of the page cache.

The implementation of bitables uses a trick that many b+ tree implementations use, having offsets to keys (and possibly values) at the front of each page and key/value data at the back. This allows each page to be built up quickly, growing inwards from both the back and the front until it is full. However, it has some disadvantages for processor cache access patterns for searching (sorted order binary searches are not very good for cache access patterns). In future, for search heavy workloads, it might be reasonable to make branch levels use an internal Van Emde Boas layout.

The Implementation

The implementation is available on GitHub and the documentation is available here. A premake4 file to create a build is included (and the Windows premake4 binary I used). On Ubuntu, I used the premake4 package (from apt-get). Let me know if there are any issues, or any other platforms you have tried it on.

The reference implementation relies on the operating system where possible (memory mapped files, the page cache) to do the heavy lifting for things like caching and ordering writes in a smart way. It is a bit naive in that the buffers it uses for writing out bitables are only one page in size (per level), so the number of system calls is probably a higher than it needs to be.

Reading the bitable uses a zero copy memory mapping system, where keys and values are come from pointers directly into the file (the mapping uses read only page protection, so it shouldn’t be possible to hose the data). Power of 2 alignments are supported for keys and values, specified when initializing the table for creation, so you can cast and access data structures in place. Currently omitting key or value sizes for fixed size keys/values is not supported, but this seems like a good future optimisation for many use cases.

Apart from initializing and opening the bitable, for read operations a few rules are followed:

No heap allocation (although, memory may be allocated into the page cache for demand caching by the operating system).
Cursors and statistics structures may be allocated on the stack.
No locking is required (including internally).
No system calls made (memory mapped IO).
There is no mutation of the bitable data structure/no reliance on internal shared mutable state.
All the code is re-entrant and there is no reliance on statics or global mutable state.

In the reference library, the bitable write completion process is atomic (the table won’t be readable unless it has been written completely), this is achieved by writing a checksummed header as the last step in the process. It also contains durable and non durable modes for finishing the bitable. The durable mode completes writes and flushes (including through the on-disk cache) in a particular order to make sure the bitable is completely on storage, the non-durable mode writes in the same order but does not perform the flushing (for use cases where a bitable is only used within the current power cycle).

The implementation has been designed so that the reading and writing parts of the bitable code are not dependent on each other (although, they are dependent on some small common parts), even though they are currently built into the library together. The library is small enough (it can fit in the instruction cache entirely) that you don’t really need to worry about this in general.

Platform and portability wise, the implementation also uses UTF-8 for all file paths (including on Windows). The table files produced by the bitable system use platform endianness etc, so they may not be portable between platforms.

Note that this is a side project for me and has been done in very limited time, so there is plenty of room for improvement, extra work and further rigor to be applied.

Does it actually perform decently?

All these benchmarks were performed on my Windows 8.1 64bit machine with 32GB of RAM, a Core i7 4790 and a Samsung 850 Pro 512GB SSD. All tests use 4KB pages and 8 byte key and value alignment. Note that these benchmarks are definitely not the final say on performance and I haven’t had time to perform adequate bench-marking for multi-threaded reads.

Our test dataset is using 129,672,136 key value pairs, with 24 byte keys (latitude and longitude as 8 byte doubles, as well as a 64bit unix millisecond time stamp), matching up to an 8 byte value. It’s 3.86 GiB, packed end to end.

Reading the data set from a warm memory mapped file and creating the bitable takes (averaged over 4 runs) 11.01 seconds in the non-durable mode and 13.08 seconds in the durable mode. The leaf level is 4.84 GiB and the 3 branch levels are 34 MiB, 240 Kib and 4 Kib respectively.

From this dataset, we then selected keys 2,026,127 keys evenly distributed across the dataset, then randomly sorted them. We then used these for queries point queries (on a single thread), which on a cold read of the bitable averaged 7.51 seconds (averaged over 4 runs again) and on warm runs 1.81 seconds (keys and values were read and verified). That’s well over a million random ordered key point queries per second on a single thread.

Tags: Storage Data Structures Key/value

A Fun Little Geometric Visibility Thing

15 October 2014

This is just an interesting little tidbit that I stumbled across a fair few years ago (2006 or so I think) and have been meaning to blog about (I might have and forgotten) ever since. There is a neat trick of geometry that allows you to solve yes-no visibility queries from a point for polygon soups with nothing more than ray-casting and a few tricks of geometry, with no polygon clipping and no sorting.

I’m not claiming this will actually be useful to anyone, in fact, floating point precision problems make it difficult to do in practice. But, it may come in handy for someone and it may well be practical for drastically simplified versions of a scene.

Firstly, we’ll talk about visibility queries without any view restrictions (omni-directional from a point). A simple assertion; for non intersecting geometry (intersecting geometry can be handled, with the caveat that the line segments where surfaces intersect must be treated as an edge), a polygon or polygonal object is visible if a ray from the query point passing through 2 edges on the silhouette of any object in the scene hits the object without being occluded.

If you are familiar with the rule for visibility queries between polygons; that there must exist an extremal stabbing line going through 4 edges between 2 polygons for them to be visible to each other, this is a natural extension to that, because the constraint of the point in the query (the eye, the camera, or whatever you want to call it) is the equivalent to the constraint of 2 edges. So we are left with the point and 2 edges to define an extremal stabbing line; of which one must exist for the visibility query to return a positive result.

Now, there is the case of determining what edges, in particular, will provide us with a ray to cast and what the ray will be. Luckily, there is a straight forward answer to this:

They must be on the silhouette of an object and if the object is a polygon connected to other polygons, this includes the edges of the polygon in the query.
For such a ray to exist for 2 edges and point, each edge must straddle the plane through the other edge and the query point.
The ray going through these 2 edges is the intersection of these planes; so it must be perpendicular to both planes. This means the direction of the ray is the cross product of the normals of the planes going through each edge and the query point.
The origin of the ray is obviously the query point.
This ray must obviously intersect with the query object too (if one or more of the edges is on the query object, this is a given).

There is of course a special case to this; 2 edges often intersect at a vertex, in which case there will exist a ray going through the query point and the vertex.

Anyway, to solve the query, you enumerate all the query point silhouette edge-edge cases and see if you find one that hits the query object without being occluded by another object. If such a ray exists, then the object is visible. You could do a whole scene at once by enumerating all the query point edge-edge cases and knowing that only the objects that got hit were visible.

You could, potentially, do something similar to a beam-tree to accelerate finding which edges straddled the planes of other edges. If I were to make a terrible pun, I would say this is a corner case of a beam-tree (in that, the rays in question are essentially the corners of significant plane intersections in the beam-tree).

This also works in reverse for the view frustum and portal cases; in this case the ray through the query point and edges must pass through the portals and the view frustum. This allows you to perform portal queries without clipping the portals.

There is also, I suspect, a fair bit of temporal coherence for the edges involved in a visibility query; if the query point moves only slightly, it is likely the same edges will produce an extermal stabbing line that proves visibility.

Tags: Geometry Visibility

Better Vertex Cache Optimised Index Buffer Compression

30 September 2014

I’ve had a second pass at the compression/decompression algorithm now, after some discussion with davean on #flipCode. The new iteration of the algorithm (which is available alongside the original) is a fair bit faster for decompression, slightly faster for compression and has a slightly better compression ratio. The updated algorithm doesn’t however support degenerate triangles (triangles that have a duplicate vertex indice), where as the original does.

The original is discussed here.

The main change to the algorithm is to change from per vertex codes to a single composite triangle code. Currently for edge FIFO hits there are two 2 bits codes and for other cases, three 2 bit codes. There are a lot of redundant codes here, because not all combinations are valid and some code combinations are rotations of each other. The new algorithm rotates the vertex order of triangles to reduce the number of codes, which lead to one 4 bit code per triangle. This nets us a performance improvement and a mild compression gain.

There were also some left over codes, so I added a special encoding that took the edge + new vertex triangle case and special cased it for the 0th and 1st edges in the FIFOs. This could potentially net very big wins if you restrict your ordering to always use a triangle adjacent to the last one, trading vertex cache optimisation for compression.

Here’s the performance of the old algorithm:

MODEL	TRIANGLES	UNCOMPRESSED	COMPRESSED	COMPRESSION TIME	DECOMPRESSION TIME
Bunny	69,630	835,560	111,786	3.7ms	1.45ms
Armadillo	345,944	4,151,328	563,122	17ms	6.5ms
Dragon	871,306	10,455,672	1,410,502	44.8ms	16.6ms
Buddha	1,087,474	13,049,688	1,764,509	54ms	20.8ms

And here is the performance of the new one:

MODEL	TRIANGLES	UNCOMPRESSED	COMPRESSED	COMPRESSION TIME	DECOMPRESSION TIME
Bunny	69,630	835,560	108,908	2.85ms	0.92ms
Armadillo	345,944	4,151,328	547,781	15.9s	4.6ms
Dragon	871,306	10,455,672	1,363,265	36.3ms	11.8ms
Buddha	1,087,474	13,049,688	1,703,928	46ms	14.7ms

Overall, there seems to be a 30%-40% speedup on decompression and a modest 3% improvement on compression.

Also, I’d like to thank Branimir Karadžić, whose changes I back-ported, which should hopefully add support for 16bit indices and MSVC 2008 (unless I’ve broken them with my changes - I don’t have MSVC 2008 installed to test).

If you haven’t checked out the source, it’s available on GitHub.

Tags: Graphics Compression

Vertex Cache Optimised Index Buffer Compression

28 September 2014

Update - There is a new post about an updated version of the algorithm listed here.

A few weeks ago, I saw Fabian Giesen’s Simple lossless(*) index buffer compression post (and the accompanying code on GitHub) and it made me think about a topic I haven’t thought about in a while, index buffer compression. A long time ago, at university, I built a mesh compressor as a project. It had compression rates for triangle lists of less than 2bits a triangle (and did pretty well with vertices too), but it had a few limitations; it completely re-ordered the triangles and it required that at most 2 triangles share an edge. The performance was semi-decent, but it was also relatively heavy and complicated, as well as requiring a fair bit of dynamic allocation for maintaining things like connectivity structures and an edge stack.

Re-ordering the triangles has the disadvantage that it destroys post-transform vertex cache optimisation (although, it did optimise order for the pre-transform vertex cache). Now, the vertex cache utilisation of these algorithms isn’t generally catastrophic, as they use locality, but definitely worse than those produced by Tom Forsyth’s algorithm.

Fabian’s algorithm has the advantage of keeping triangle order, while performing better when vertex cache optimisation has been performed. It’s also incredibly simple, easy to implement and uses very little in the way of extra resources. The key insight is that triangles that have been vertex cache optimised are likely to share an edge with the previous triangle.

But what if we extend this assumption a little, in that a triangle is also likely to share edges and vertices with other recent triangles. This is pretty obvious as this is exactly what the vertex cache optimisation aims for when it re-orders the triangles.

The Algorithm

Post-transform vertex caches in hardware are typically implemented as fixed size FIFOs. Given that is a case we’ve already optimized for (and fixed sized ring-buffer FIFOs are very cheap to implement), this is the concept we’ll use as the basis for our algorithm. As well as a vertex index FIFO cache, we’ll also introduce an edge FIFO cache that has the most recent edges.

What about cache misses? Well, these fall into two categories; new vertices that haven’t yet been seen in the triangle list and cache misses for repeated vertices. If we re-order vertices so that they are sorted in the order they appear in the triangle list, then we can encode them with a static code and use an internal counter to keep track of the index of these vertices.

Cache misses are obviously the worst case for the algorithm and we encode these relative to the last seen new vertex, with a v-int encoding.

So, we end up with four codes (which, for a fixed code size, means 2 bits):

New vertex, a vertex we haven’t seen before.
Cached edge, edges that have been seen recently and are in the FIFO. This is followed by a relative index back into the edge FIFO (using a small fixed bit with; 5bits for a 32 entry FIFO).
Cached vertex, vertices that been seen recently and are in the FIFO. This is followed by a relative index back into the vertex FIFO (same fixed bit with and FIFO size as the cached edge FIFO).
Free vertex, for vertices that have been seen but not recently. This code is followed by a variable length integer encoding (like that seen in protobuf) of the index relative to the most recent new vertex.

Triangles can either consist of two codes, a cached edge followed by one of the vertex codes, or of 3 of the vertex codes. The most common codes in an optimised mesh are generally the cached edge and new vertex codes.

Cached edges are always the first code in any triangle they appear in and may correspond to any edge in the original triangle (we check all the edges against the FIFO). This means that an individual triangle may have its vertices specified in a different order (but in the same winding direction) than the original uncompressed one. It also simplifies the implementation in the decompression.

While we use a quite simple encoding, with lots of fixed widths, the algorithm has also been designed such that an entropy encoder could be used to improve the compression.

With the edge FIFO, for performance on decompression, we don’t check an edge is already in the FIFO before inserting it, but we don’t insert an edge in the case that we encounter a cached edge code. This is the same for cached vertices. This means the decompression algorithm does not have to probe the FIFOs.

Results

I’ve provided a simple proof of concept implementation on GitHub. These results were obtained with a small test harness and this implementation, using Visual Studio 2013 and on my system (with a bog standard i7 4790 processor).

To make things consistent, I decided to make my results relative to those provided in Fabian’s blog, using the same Armadillo mesh. His results use a file that encodes both the positions of the vertices and the indices in a simple binary.

The Armadillo mesh has 345,944 triangles, which is 4,151,328 bytes of index information. After running a vertex cache optimisation (using Tom Forsyth’s algorithm) and then compressing with my algorithm, the index information takes up 563,122 bytes, or about 13 bits a triangle. By comparison, just running 7zip on just the optimised indices comes in at 906,575 bytes (so, it quite handily beats much slower standard dictionary compression). Compression takes 17 milliseconds and decompression takes 6.5 milliseconds, averaged over 8 runs. Even without the vertex cache optimisation, the algorithm still compresses the mesh (but not well), coming in at 1,948,947 bytes for the index information.

For comparison to Fabian’s results and using the same method as him, storing the uncompressed vertices (ZIP and 7z results done with 7-zip):

STAGE	SIZE	.ZIP SIZE	.7Z SIZE
Compression (Vertex Optimised)	2577k	1518k	1276k

Performance wise, the algorithm offers linear scaling, decompressing about 50 million triangles a second and compressing about 20 million across a variety of meshes with my implementation on a single core. For compression, I used a large pre-allocated buffer for the bitstream to avoid the overhead of re-allocations. Compression is also very consistently between 12 and 13bits a triangle.

The compressed and uncompressed columns in the below table refer to sizes in bytes. The models are the typical Stanford scanned test models.

MODEL	TRIANGLES	UNCOMPRESSED	COMPRESSED	COMPRESSION TIME	DECOMPRESSION TIME
Bunny	69,630	835,560	111,786	3.7ms	1.45ms
Armadillo	345,944	4,151,328	563,122	17ms	6.5ms
Dragon	871,306	10,455,672	1,410,502	44.8ms	16.6ms
Buddha	1,087,474	13,049,688	1,764,509	54ms	20.8ms

Notes About the Implementation

Firstly, the implementation is in C++ (but it wouldn’t be too hard to move it to straight C, if that is your thing). However, I’ve eschewed dependencies where-ever possible, including the STL and tried to avoid heap allocation as much as possible (only the bitstream allocates if it needs to grow the buffer). I’ve currently only compiled the implementation on MSVC, although I have tried to keep other compilers in mind. If you compile somewhere and it doesn’t work, let me know and I’ll patch it.

The implementation provides a function that compresses the indices (CompressIndexBuffer) and one that decompresses the indices (DecompressIndexBuffer), located in IndexBufferCompression.cpp/IndexBufferCompression.h and IndexBufferDecompression.cpp/IndexBufferDecompression.h. There’s also some shared constants (nothing else is shared between them) and some very simple read and write bitstream implementations. While this is a proof of concept implementation, it’s designed such that you can just drop the relevant parts (compression or decompression) into your project without bringing in anything else. Compression doesn’t force the flushing of the write stream, so make sure you call WriteBitstream::Finish to flush the stream after compressing, or the last few triangles might be incorrect.

There has been little optimisation work apart from adding some force-inlines due to compiler mischief, so there is probably some room for more speed here.

I’ve allocated the edge and vertex FIFOs on the stack and made the compression and decompression functions work on whole mesh at a time. It’s possible to implement the decompression to be lazy, decompressing a single (or batch) of triangles at a time. Doing this, you could decompress in-place, as you walked across the mesh. I’m not sure or not if the performance is there for you to do this in real-time and that probably varies on your use case.

Tags: Graphics Compression

C++ has no home

23 August 2014

On reviewing the new C++ standard, C++14, I’ve come to the conclusion that modern C++ has no home. C and a very limited subset of C++ (colloquially known as C+, which hasn’t seen many improvements since C++98), are at home as systems languages, because very few languages offer the capability to work with constructs that are known to be exculpated from unpredictable performance overhead. They also offer the ability to play with memory in a very raw but controlled way, which can be very useful for systems languages. There are a plethora of higher level languages, run-times and frameworks which are far better at being application languages (both server and client side) than C++ and offer good enough performance, better semantics, far superior standard libraries and far greater productivity than modern C++.

On top of this, C++ has continued to become more complicated (C++98 was already a complicated and semantically obtuse beast) and even as someone who has been working with the language for over 15 years, it can be non-trivial to understand what someone using some of the newer features is trying to achieve. Beyond that, someone trying to make the features of C++ work precludes features being added that actually add something to the language (type safe discriminated unions and de-structuring!). Compile times somehow manage to get worse with each iteration of the language and compiler error messages when something goes wrong with modern C++ mechanisms are notoriously hard to understand. To top this off, the non-trivial semantics make it hard to reason about the performance implication of code when that is the only reason left to use C++ over other languages that offer higher level features.

A large part of the reason for this is the persistence with using textual replacement as the main means to achieve anything in C++. This is the hangover from C, where header files, clumsy separable compilation and the preprocessor were king. The entire problem with templates and C++ compile times in general comes from this mechanism, where instead of relying on sensible and well-defined compiler semantics that transform code into well-defined types that the compiler can hold easily in memory between compilation units (where the compiler only needs to know the symbols and types match, outside of optimisation), templates are essentially an expanded blob of text with some symbols replaced and they have to be expanded (with checks) for every separate compilation unit (sometimes several times), where the semantics might be different each time. In fact, every type and function known to the compiler has to be fully expanded and parsed (with possibly changing semantics) for every compilation unit. Is there a greater sign that this mechanism is broken than the pimpl concept?

Now, in some circles, what I’m going to say is blasphemy, but, really it is time for language mechanisms based on textual replacement to have a pillow put over their collective faces and for the rest of us to throw a drinking fountain through the window and escape. It’s the reason we can’t have nice things. Other languages manage to provide generic typing and meta-programming mechanisms without mile long compiler messages or one hour compile times (yes, even with types that have different sizes and don’t require boxing). I often see C++ programmers complaining about C++ language features as if the concepts (like generic typing, lambdas/closures and meta-programming) are themselves bad, when it is not the concept, but the C++ implementation that is so horrible.

As an aside, exceptions are also implemented poorly in C++, but for other reasons. In the exception holy war, people tend to pick a side based on whether performance or robustness is the primary goal and C++ can get away with a poor exception handling mechanism because many of the people that use it are on the performance side of the fence and avoid exception handling anyway.

Yes, we have been promised a module system, but I don’t really think it will ever be enough, because the language already brings too much baggage. I think it has actually reached the stage where we need a new language in this space, that offers what C+ and C offer (apart from textual replacement and the clumsy separable compilation mechanism), while bringing in far more sensible versions of the features C++ is trying to provide.

Tags: Programming

Previous Next

Page: 3 of 4

Home Software Development Trash!