Horizontal SSE Stable Sort Indice Generation

19 September 2015

Well, the topic of this blog post is a deceptively simple one; given 4 float values (although you could implement it for integers), in an SSE register, we’re going to perform a stable sort and output the destination indices (from 0 to 3) for the sorted elements in another SSE register, without branches. This came about because I thought it might be useful for some QBVH nearest neighbour distance queries I’m potentially going to be implementing.

I haven’t yet thought about if or how this would be extended to a wider instruction set like AVX, but I don’t think it would be impossible (just hard).

The Basic Premise

Firstly, in this case, just sorting the keys themselves isn’t enough, we actually want to be able to treat them as keys to re-order another set of values. To achieve this, we’re going to be producing destination indices (i.e. each original key will have a new “destination” in the sorted output), which can be used either to re-order the original keys, or other values stored in the same order as the original keys.

Instead of using swap operations, as you would in a regular sorting network implementation, we’re instead going to build the indices using a system where all keys start at index 0, then we increase the index by 1 if it has a “losing” comparison against another key. So, if a key loses a comparison against all 3 other keys, its index will be 3.

Obviously, a naive implementation of this would have problems with keys that are equal. To get around this, we’re going to implement a simple tie-breaking rule, which is that in the case that the two keys are equal, then the key with the original highest place in the list will lose. This will also give the sort the property of being stable, so values that are equal will maintain their original relative ordering.

One of the interesting things about this is that because the operation of changing the relative orders is associative (adding 1 to the index), it does not matter if we use the same key in multiple comparisons at once, because we can combine the offsets to the indices at the end.

So, assuming an ascending sort, we’ll do two “rounds” of comparisons (considering we can do 4 at once). Each comparison will be a greater than (with the value with the lower “original” index being the left operand and the higher the second operand). This means if the left value is greater than the first, its output index will be incremented by one, but if the right value is greater than or equal to the first, its output index will be incremented.

The first round looks like this:

key[ 0 ] > key[ 1 ]
key[ 1 ] > key[ 2 ]
key[ 2 ] > key[ 3 ]
key[ 0 ] > key[ 3 ]  

The second round looks like this:

key[ 0 ] > key[ 2 ]
key[ 1 ] > key[ 3 ]

The Implementation

The total implementation (for the actual sorting algorithm, not counting loading the input, storing the output indices or loading constants, because they are assumed to be done out of loop) comes in at 14 operations and only requires SSE2. The operation counts are:

  • 2 adds
  • 2 and
  • 1 and not
  • 5 shuffles
  • 2 greater than compares
  • 2 xors

We’ll explain the implementation step by step…

First, we load the constants. My assumption for the algorithm is that these will be done out of loop and be kept in registers (which will be the case for my implementation, but may not always be true):

__m128i flip1 = _mm_set_epi32( 0xFFFFFFFF, 0, 0, 0 );
__m128i flip2 = _mm_set_epi32( 0xFFFFFFFF, 0xFFFFFFFF, 0, 0 );
__m128i mask1 = _mm_set1_epi32( 1 );

Now for the actual sorting algorithm itself. First we shuffle the input to setup the first round of comparisons:

__m128 left1  = _mm_shuffle_ps( input, input, _MM_SHUFFLE( 0, 2, 1, 0 ) );
__m128 right1 = _mm_shuffle_ps( input, input, _MM_SHUFFLE( 3, 3, 2, 1 ) );

After that, we compare the first values and increment the indices of the losers. Note, there are 8 possible losers in the set (and 4 actual losers), so we will have two 4 element registers as the output for this step (with each element either being 1 or 0). Because there are instances where we reference the same key twice (on both sides) in this set of comparisons and we want to do the increments for the “losing” keys in parallel (where we can’t add 2 source elements from the register into one destination), we use an xor to flip the bits for the last element, effectively switching the potential loser at element 3, with the potential loser at element 0. This also avoids a shuffle on the first set of loser increments, because they are already arranged in the right order. Note that the losers on one side of the comparison match up with the winners on the other (we just need to shuffle them to the appropriate places).

__m128i tournament1 = _mm_castps_si128( _mm_cmpgt_ps( left1, right1 ) ); 
__m128i losers1     = _mm_xor_si128( tournament1, flip1 );
__m128i winners2    = _mm_shuffle_epi32( losers1, _MM_SHUFFLE( 2, 1, 0, 3 ) );

Because our comparisons result in 0xFFFFFFFF instead of 1, we mask out everything but the bottom bit (because we only want to increment by 1). We convert the winners to losers (with a complement) and mask out the bits in a single step using the and-not instrinsic/instruction.

__m128i maskedLosers1 = _mm_and_si128( losers1, mask1 );
__m128i maskedLosers2 = _mm_andnot_si128( winners2, mask1 );

Next, we need to do the second round of comparisons. This time we only have 2 comparisons and 4 possible losers (of which 2 will actually be losers). Also luckily, the four possible losers are each one of the four keys. This time what we’re going to do is double up on the comparisons (doing each one twice, but all the results in one register), but then flip the results for the top two, before masking the values to 1.

__m128i tournament2   = _mm_castps_si128( _mm_cmpgt_ps( left2, right2 ) );
__m128i losers3       = _mm_xor_si128( tournament2, flip2 );
__m128i maskedLosers3 = _mm_and_si128( losers3, mask1 );

Finally, we will add all the masked losers together, like a failed superhero convention:

__m128i indices = _mm_add_epi32( _mm_add_epi32( maskedLosers1, maskedLosers2 ), maskedLosers3 );

Now if you want to, you can take 4 unsorted values (in the same order as the keys) and scatter them using these indices. Unfortunately SSE doesn’t have any gather/scatter functionality, so you’ll have to do that bit the old fashioned way.

Have You Tested It?

Why yes! I have enumerated all possible ordering permutations with repetition (using the values 0.0, 1.0, 2.0 and 3.0). Early performance testing implies low single digit nanoseconds per sort.

What about NaNs?

NaNs aren’t possible in my current use case, so I haven’t walked through all the possibilities yet.

Update, Shaving Off Two Instructions!

So, a small brainwave, because the compares come out as 0xFFFFFFFF, which is -1 (if we treat them as integers), we can subtract them, instead of adding them, and lose the masking ands. We keep the and-not, because we need the complement (and it saves doing a negate/subtraction from zero). The final composite becomes:

__m128i indices = _mm_sub_epi32( _mm_sub_epi32( maskedLosers2, losers1 ), losers3 );

Remove the lines calculating maskedLosers1 and maskedLosers3, here is the actual implementation all together, at 12 instructions:

__m128  left1         = _mm_shuffle_ps( input, input, _MM_SHUFFLE( 0, 2, 1, 0 ) );
__m128  right1        = _mm_shuffle_ps( input, input, _MM_SHUFFLE( 3, 3, 2, 1 ) );
__m128i tournament1   = _mm_castps_si128( _mm_cmpgt_ps( left1, right1 ) ); 
__m128i losers1       = _mm_xor_si128( tournament1, flip1 );
__m128i winners2      = _mm_shuffle_epi32( losers1, _MM_SHUFFLE( 2, 1, 0, 3 ) );
__m128i maskedLosers2 = _mm_andnot_si128( winners2, mask1 );
__m128  left2         = _mm_shuffle_ps( input, input, _MM_SHUFFLE( 1, 0, 1, 0 ) );
__m128  right2        = _mm_shuffle_ps( input, input, _MM_SHUFFLE( 3, 2, 3, 2 ) );
__m128i tournament2   = _mm_castps_si128( _mm_cmpgt_ps( left2, right2 ) );
__m128i losers3       = _mm_xor_si128( tournament2, flip2 );
__m128i indices       = _mm_sub_epi32( _mm_sub_epi32( maskedLosers2, losers1 ), losers3 );

Tags: SSE  Optimization 



SDFs and Hilbert Curves

07 August 2015

So, this is more of a “thought bubble” blog post than anything concrete yet, but I thought it might be worth some discussion, or at least putting the ideas out there because someone else might find them useful.

I’ve been thinking a lot lately about the potentials for combining traversals in a Hilbert curve ordering with regularly sampled SDFs (signed distance field, which is a sampling of the nearest distance to the surface of an object, with a sign derived from whether the sample is inside or outside the object). Using a Hilbert curve is already a well known technique for improving spatial locality for regularly sampled grids, but there are some properties that apply to SDFs with fairly accurate euclidean distances that sit well with Hilbert curves. Note that what I’m talking about is mainly with 3D SDFs/Hilbert curves in mind, but it applies to 2D as well (as a side note for the pedant in me, if we’re talking higher dimensions, SDFs aren’t really a good idea).

The main property of the Hilbert ordering we’re interested in is that unlike other orderings, for example the Morton/Z-Curve, Hilbert provides an ordering where each consecutive grid point is rectilinearly adjacent to the previous one (only a single step in a single axis). This also happens to be the minimum distance possible to move between two grid points (for a grid that is uniformly spaced in each axis). A Hilbert curve isn’t the only ordering that provides this property, but it does provide great spatial locality as well.

There is a simple constraint on euclidean distance functions, which is that the magnitude of the difference of the distance function value between two sample points can’t be greater than the distance between the sample points. This property also holds for signed distance functions and it holds regardless of the magnitude of the distances.

Add these two little pieces together and there are some “tricks” you can pull.

One of the first is that if you traverse a SDF’s grid points in Hilbert order, you can delta encode values and know the magnitude of the delta will never be greater than the distance between the two adjacent grid points, meaning that a you can achieve a constant compression ratio that is lossless vs the precision you need to represent the distance field globally (this is without getting into predictors or more advanced transforms). This also works for tiles, where you could represent each tile with one “real” distance value for the first point, along with deltas for the rest of the points, traversing the tile in Hilbert order. This would make the compressed SDF randomly accessible at a tile level (similar to video compression, where you have regular frames that don’t depend on any others, so you can reconstruct without decompressing the whole stream), although you would still potentially have to sample multiple tiles if you wanted to interpolate between points. If you wanted to go further, you could use more general segments of the curve instead of tiles.

Another neat trick; if you’re building the SDF from base geometry using a search for the nearest piece of geometry at each point and you build traversing in Hilbert, you can both put a low upper bound on the distance you have to search. This is very useful for doing a search in spatial partitions and bounding volume hierarchies, or even probing spatial hashes. You also get the advantage of spatial locality, where you will be touching the same pieces of geometry/your search structure, so they will likely be in your cache (and if you are transforming your geometry to another representation, such as a point based one, you can do it lazily). Also, you can split the work into tiles/segments and process them in parallel.

This is actually a topic close to my heart, as my first/primary task during my research traineeship at MERL around 15 years ago was investigating robustly turning not-so-well-behaved triangle meshes into a variant of SDFs (adaptively sampled distance fields, or ADFs), which is an area that seems to be drawing a lot of attention again now. This algorithm in particular is a neat new approach to this problem.




Adding Vertex Compression to Index Buffer Compression

28 April 2015

Index buffer compression is nice (and justifiable), but your average mesh tends to actually have some vertex data that you might want to compress as well. Today we’re going to focus on some simple and generic ways to compress vertex attributes, in the context of the previously discussed index buffer compression, to turn our fairly small and fast index buffer compression/decompression into more complete mesh compression.

In this case, we’re going to focus on a simple and fairly general implementation that favours simplicity, determinism and running performance over the best compression performance.

As with previous posts, the source code for this one is available on github, but in this case I’ve forked the repository (to keep the original index buffer implementation separate for now) and it isn’t yet as polished yet as the index buffer compression repository.

Compressing Vertices

When it comes to compression of this kind of data, the general approach is to use a combination of quantization (trading precision for less trailing digits) and taking advantage of coherence with transforms/predictions that we can use to skew the probability distribution of the resulting quantities/residuals for entropy encoding. We’ll be using both here to achieve our goals.

Quantization

With this implementation we’re actually going to be a bit sneaky and shift the responsibility for quantization outside the main encoder/decoder. Part of the reason for this is that it allows downstream users the ability to choose their own quantization scheme and implement custom pre-transforms (for example, for my test cases, I transformed normals from a 3 component unit vector to a more effective octahedron format). Another part of the reason is that some of the work can be shifted to the vertex shader.

The encoder is designed to take either 16 bit or 32bit integers, where the values have already undergone quantization and had their range reduced (the maximum supported range is -2^29 to 2^29 - 1, which is limited by a mix of the prediction mechanism, which can have an residual range larger than the initial values, as well as the coding mechanism used). The compression after quantization is bit exact, so the exact same integers will come out the other side. This has some advantages if some of your attributes are indices that need to be exact, but are still compressible because connected vertices have similar values.

The first iteration of the mesh compression library doesn’t actually contain the quantization code yet, only the main encoder and decoder. For positions (and this would apply to texture coordinates as well), quantization was done by taking the floating point components and their bounding box, subtracting the minimum in each axis, then dividing by the largest range of any axis (to maintain aspect ratio) and then scaling and biasing into the correct range and rounding to the nearest integer.

Prediction

One of the unfortunate things about vertices is that in a general triangle mesh, they have very little in the way of implicit structuring that makes them compressible by themselves. When you are compressing audio/time series data, images, voxels or video, the samples are generally coherent and regular (in a hyper-grid), so you can apply transforms that take advantage of that combination of regular sampling and coherence.

There are ways we can impose structure on the vertex data (for example, re-ordering or clustering), but this would change the order of the vertices (the ordering of which we already rely on when compressing connectivity). As an aside, for an idea of what regularly structured geometry would entail, check out geometry images.

We do however have some explicit structuring, in terms of the connectivity information (the triangle indices), that we can use to imply relationships between the vertices. For example, we know that vertices on the same edge are likely close to each other (although, not always), which could enable a simple delta prediction, as long as we always had a vertex that shared an edge decompressed before hand.

Taking this further, if we have a triangle on an opposing edge to the vertex in question, we can use the parallelogram predictor (pictured below) introduced by Touma and Gotsman. If you have followed any of the discussions between Won Chun and myself on twitter, or have looked into mesh compression yourself, this probably won’t be new to you. The predictor itself involves extending the triangle adjacent the edge opposing the vertex and extending it like a parallelogram.

The Parallelogram Predictor
1. The Parallelogram Predictor

It turns out we can implement these predictors with only a small local amount of connectivity information and if we do them in the same pass as our index buffer encoding/decoding, this information is largely already there. If you recall the last post, most meshes had the vast majority of their triangles coming from an edge in the edge cache, so we can include the index to the third vertex of the triangle that edge came from in the FIFO and use that to perform parallelogram prediction.

This is fairly standard for mesh compression, as there is nearly always information used in the connectivity encoding that gives you a connected triangle. Some systems go a fair bit further, using full connectivity information and a multiple passes to produce more accurate vertex predictions.

For the rarer cases where there is not an edge FIFO hit, we can use another vertex in the triangle for delta prediction. It turns out only one vertex case (the first vertex in a new-new-new triangle) does not have a vertex relatively available to predict from. One of the advantages for this is that it keeps our compression working in the case where we don’t have many shared edges, albeit less efficiently.

One of the limitations of the parallelogram predictor is that the point it predicts is on the same plane as the triangle the prediction was based on. This means if your mesh has a large curve, there will be a reasonable amount of error. There are multiple ways to get around this problem; the original Touma and Gotsman paper used the 2 previously decoded edges closest to the direction of the edge adjacent the vertex to be decoded and averaged the “crease” direction of the triangles, then applied this to move the prediction. Other papers have used estimates from other vertices in the neighborhood and used more complex transform scheme. One paper transformed all the vertices to a basis space based on the predicting triangle, ran k-means to produce a limited set of prediction points, then encoded the index to the closest point along with the error. There are many options for improving the general prediction.

For my purposes though, I’m going to make an engineering decision to trade off a bit of compression for simplicity in the implementation and stick with the bare bones parallelogram predictor. There are a couple of reasons for this; firstly, it’s very easy to make the bare bones predictor with simple integer arithmetic (deterministic, fast and exact). You can apply it for each component, without referencing any of the others, and it works passing well for things like texture coordinates and some normal encodings.

One of the advantages about working with a vertex cache optimized ordering is that the majority of connected vertices have been recently processed, meaning they are likely in the cache when we reference them for decompression.

Entropy Encoding

Now, entropy encoding for the vertex data is quite a different proposition to what we used for connectivity data. Firstly, we aren’t using a set quantization pass, so straight up the probability distribution of our input is going to vary quite a lot. Secondly, as the distance between vertices in different meshes varies quite a lot, prediction errors are probably going to vary quite a lot. Thirdly, even with good predictors on vertex data you can have quite large errors for some vertices. All of this means that using a static probability distribution for error residuals is not going to fly.

Let’s look at the number of occurrences for residuals just for the parallelogram predictor, for the bunny model vertex positions, at 14 bits a component (graph below). We’ve applied a ZigZag encoding, such as used in protocol buffers, to allow us to encode negative residuals as positive numbers. The highest residual is 1712 (although there are many residuals below that value that don’t occur at all) and the number of unique residuals that occur is 547.

The Error Residuals for Vertex Positions of the Bunny Mesh
2. The Error Residuals for Vertex Positions of the Bunny Mesh

A fairly standard approach here would be to calculate the error residuals in one pass, while also calculating a histogram, which we could then use to build a probability distribution for a traditional entropy encoder, where we could encode the vertices in a second pass. This introduces another pass in encoding (we’ve already introduced one potentially for quantization), as well as the requirement to buffer residuals. It also means storing a code table (or probabilities/frequencies) in the compression stream. For our connectivity compressor we used Huffman and a table driven decoder, lets look at how that approach would work here.

If we feed the above data into a Huffman code generator, we end up with the longest code being 17bits, meaning we would need a table of 131072 entries (each of which would be larger than for our connectivity compression, because the residuals themselves are larger than a byte). That’s a bit steeper than our connectivity compression, but we could get the table size down using the approach outlined in Moffat and Turpin’s On the Implementation of Minimum Redundancy Prefix Codes paper, at the cost of an increase in complexity of the decoder. However, other models/quantization levels might require even larger tables.

But how would compression be? For just the position residuals for the parallelogram predicted vertices (which is 34634 of 34817 vertices) we would use 611,544 bits (not counting the code/frequency table, which would also be needed). That’s reasonable at 17.66 bits a vertex, or 5.89 bits a component (down from 14).

The compression is reasonable for this method (which we would expect, given it’s pedigree). But it brings with it some disadvantages; an extra encoding pass and buffering, storing a code table, extra complexity or a large decoding table as well as some overhead in decoding to build the decoding table.

Let’s take a different approach.

Universal codes allow any integer to be encoded, but each code ends up with an implied probability correlated to how many bits it uses. You can use them for fairly general entropy encoding where you know the rank of a set of codes, but they don’t perform as well as Huffman coding usually, because the probability distribution of real codes usually doesn’t match that of the universal codes exactly.

Exponential-Golomb codes are a kind of universal code with some interesting properties. They combine an Elias gamma code to represent the most significant bits of a value, with a fixed number of the lowest significant bits (k) encoded directly. Also, I dare you to try and type “Golomb” without inserting a “u”. The first part of the code is a unary code that will have quite a limited number of bits for our range of values, which we can use a bit scan instruction to decode instead of a table lookup, so we can build a fast decoder without running into memory problems (or cache miss problems).

Now, it turns out this is a good model for the case where you have a certain number of low bits which have approximately a uniform probability distribution (so they are quite noisy), with the bits above the k’th bit being increasingly unlikely to be switched on. The shortest length codes for a particular value are produced when the most significant bit also happens to be the top bit in a k-bits integer.

So, if we can correctly estimate what k should be for a particular component residual at a particular time, then we can get the best bit length encoding for that value. If we make the (not always correct) assumption that the traversal order of vertices means that similar vertices will be close together and that the residuals for each component are probably going to be similar in magnitude, then a moving average based off what the optimal k for the previous values were might be a reasonable way to estimate an optimal k (a kind of adaptive scheme, instead of a fixed code book).

I decided to go with an exponential moving average, because it requires very little state, no lookbacks and we can calculate certain exponential moving averages easily/quickly in fixed point (which we really want, for determinism, as we want to be able to reproduce the exact same k on decompression). After some experimentation and keeping in mind what we can calculate with shifts, I decided to go with an alpha of 0.125. In 16/16 fixed point, the moving average looks like this (where kEstimate is the k that would give the best encoding for the current value):

k = ( k * 7 + ( kEstimate << 16 ) ) >> 3;

Also, after a bit of experimentation, I went with a bit of a simplified version of exponential golomb coding that has a slightly better probability distribution (for our purposes) and is slightly faster to decode. Our code will have a unary code starting from the least significant bits, where the number of zeros will encode the number of bits above k that our residual will have. If that number is 0, what will follow will be k bits (representing the residual, padded out with zeros in the most significant bits to k bits). If it is more than zero, after will follow the residual without the most significant bit (which will be implied by the position we can calculate from the unary code).

Below are simplified versions of the encoding and decoding functions for this universal code. Note that we return the number of bits from writing the value, so we can feed it into the moving average as our estimate for k. The DecodeUniversal in this case uses the MSVC intrinsic for the bit scan, but in our actual implementation we also support the __builtin_ctz intrinsic from gcc/clang and a portable fallback (although, I need to test these platforms properly).

inline uint32_t WriteBitstream::WriteUniversal( uint32_t value, uint32_t k )
{
    uint32_t bits = Log2( ( value << 1 ) | 1 );

    if ( bits <= k )
    {
        Write( 1, 1 );
        Write( value, k );
    }
    else
    {
        uint32_t bitsMinusK = bits - k;

        Write( uint32_t( 1 ) << bitsMinusK, bitsMinusK + 1 );
        Write( value & ~( uint32_t( 1 ) << ( bits - 1 ) ), bits - 1 );
    }

    return bits;
}

inline uint32_t ReadBitstream::DecodeUniversal( uint32_t k )
{
    if (m_bitsLeft < 32)
    {
        uint64_t intermediateBitBuffer = 
            *( reinterpret_cast< const uint32_t* >( m_cursor ) );
        
        m_bitBuffer |= intermediateBitBuffer << m_bitsLeft;

        m_bitsLeft  += 32;
        m_cursor    += 4;
    }

    unsigned long leadingBitCount;

    _BitScanForward( &leadingBitCount, 
                     static_cast< unsigned long >( m_bitBuffer ) );

    uint32_t topBitPlus1Count = leadingBitCount + 1;

    m_bitBuffer >>= topBitPlus1Count;
    m_bitsLeft   -= topBitPlus1Count;

    uint32_t leadingBitCountNotZero = leadingBitCount != 0;
    uint32_t bitLength              = k + leadingBitCount;
    uint32_t bitsToRead             = bitLength - leadingBitCountNotZero;

    return Read( bitsToRead ) | ( leadingBitCountNotZero << bitsToRead );
}

There are a few more computations here than in a table driven Huffman decoder, but we get away without the table look-up memory access and the cache miss that could bring for larger tables. I am a fan of the adage “take me down to cache miss city where computation is cheap but performance is shitty”, so this sits a bit better with my performance intuitions for potentially unknown future usage.

But what sort of compression do we get? It turns out for this particular case, although definitely not all cases, our adaptive universal scheme actually outperforms Huffman, with a total of 604,367 bits (17.45 bits a vertex or 5.82 bits a component). You might ask how, given Huffman produces optimal codes (for whole bit codes) that we managed to beat it; the answer is that less optimal adaptive coding can beat better codes with a fixed probability distribution. In some cases it performs a few percent worse, but overall it’s competitive enough that I believe the trade offs for encoder/decoder simplicity and performance are worth it.

Encoding Normals

For encoding vertex normals I use the precise Octahedral encoding described in A Survey of Efficient Representations for Unit Vectors. This encoding works relatively well with parallelogram prediction, but has the problem of strange wrapping on the edges of the mapping, which makes prediction fail at the borders, meaning vertices on triangles that straddle those edges can end up with high residual values. However, the encoding itself is good enough it can stand up to a reasonable amount of quantization.

Because this encoding is done in the quantization pass (outside of the main encoder/decoder), it’s pretty easy to drop in other schemes of your own making.

Results

For now, I’m going to stick to just encoding positions and vertex normals (apart from the Armadillo mesh, which doesn’t have normals included), because the meshes we’ve been using for results up until this point don’t have texture coordinates. At some stage I’ll see about producing some figures for meshes with texture coordinates (and maybe some other vertex attributes).

First let’s look at the compression results for vertex positions at 10 to 16 bit per component. Results are given in average bits per component, to get the amount of storage for the final mesh, multiply the value by the number of vertices times 3. Note that the number for 14 bit components is slightly higher than that given earlier, because this includes all cases, not just those using parallelogram prediction.

MODEL     VERTICES     10 bit     11 bit     12 bit     13 bit   14 bit   15 bit   16 bit
Bunny   34,817   2.698   3.280   4.049   4.911   5.842   6.829   7.828
Armadillo   172,974   2.231   2.927   3.765   4.553   5.466   6.290   7.209
Dragon   438,976   2.640   3.438   4.297   5.292   6.306   7.327   8.351
Buddha   549,409   2.386   3.123   4.037   5.014   6.017   7.032   8.055
                 

Now lets look at normals for bunny, dragon and buddha, from 8 to 12 bits a component, using the precise octahedral representation (2 components). Note that the large predictions errors for straddling edges cause these to be quite a bit less efficient than positions, although normals do not predict as well with the parallelogram predictor in general.

MODEL     VERTICES     8 bit     9 bit     10 bit     11 bit   12 bit  
Bunny   34,817   4.840   5.817   6.813   7.814   8.816  
Dragon   438,976   4.291   5.255   6.250   7.251   8.252  
Buddha   549,409   4.664   5.632   6.627   7.627   8.628  
               

Now, let’s look at decompression speeds. Here are the speeds for the Huffman table driven index buffer compression alone. Note, there has been one or two optimisations since last time that got the speed closer to the fixed width code version.

MODEL     TRIANGLES     UNCOMPRESSED     COMPRESSED     COMPRESSION TIME     DECOMPRESSION TIME (MIN)   DECOMPRESSION TIME (AVG)   DECOMPRESSION TIME (MAX)   DECOMPRESSION TIME (STD-DEV)
Bunny   69,630   835,560   70,422   2.71ms   0.551ms   0.578ms   0.623ms   0.030ms
Dragon   871,306   10,455,672   875,771   31.7ms   7.18ms   7.22ms   7.46ms   0.060ms
Buddha   1,087,474   13,049,688   1,095,804   38.9ms   8.98ms   9.02ms   9.25ms   0.058ms
                 

Here are the figures where we use quantization to reduce vertices down to 14 bits a component and normals encoded as described above in 10 bits a component, then run them through our compressor. Timings and compressed size include connectivity information, but timings don’t include the quantization pass. Uncompressed figures are for 32 bit floats for all components, with 3 component vector normals.

MODEL     TRIANGLES     UNCOMPRESSED     COMPRESSED     COMPRESSION TIME     DECOMPRESSION TIME (MIN)   DECOMPRESSION TIME (AVG)   DECOMPRESSION TIME (MAX)   DECOMPRESSION TIME (STD-DEV)
Bunny   69,630   1,671,168   206,014   5.20ms   1.693ms   1.697ms   1.795ms   0.015ms
Dragon   871,306   20,991,096   2,598,507   62.87ms   21.25ms   21.29ms   21.36ms   0.015ms
Buddha   1,087,474   26,235,504   3,247,973   79.75ms   26.83ms   27.00ms   28.79ms   0.44ms
                 

Also, here’s an interactive demonstration that shows the relative quality of a mesh that has been round-tripped through compression with 14 bit component positions/10 bit component normals as outlined above (warning, these meshes are downloaded uncompressed, so they are several megabytes).

At the moment, I think there is still some room for improvement, both compression wise and performance wise. However, I do think it’s close enough in both ballparks to spark some discussions of the viability. One of the things to note is that the compression functions still provide the vertex re-mapping, so it is possible to store vertex attributes outside of the compressed stream.

It should be noted, the connectivity compression doesn’t support degenerate triangles (a triangle with two or more of the same vertex indices), which is still a limitation. As these triangles don’t add anything visible to a mesh, it is usually alright to remove them, but there are some techniques that might take advantage of them.

A Quick Word on an Interesting Variant

I haven’t investigated this thoroughly, but there is an interesting variant of the compression that gets rid of the free vertex (vertex cache miss) cases and treats them the same as a new vertex case. This would mean encoding some vertices twice, but as the decoding is bit exact the end result would be the same. Because these vertex reads are less frequent than their new vertex cousins (less than a third than in the bunny), it means that the overhead would be relatively small and still a fair bit smaller than the uncompressed result (and you would gain something back on the improved connectivity compression). However, it would mean that you could decode vertices only looking at those referenced by the FIFOs (which would be whole triangles in the case of the edge FIFOs) and that essentially you could reduce processing/decompression to a small fixed size decompression state and always forward reading from the compressed stream.

The requirement for large amounts of state (of potentially unknown size), as well as large tables for Huffman decoding, is part of what has made hardware support for mesh decompression a non-starter in the past.

Wrapping Up

There are a lot of details that didn’t make it into this post, due to to time poverty, so if you are wondering about anything in particular, leave a comment or contact me on twitter and I will try to fill in the blanks. I was hoping to get more into the statistics behind some of the decisions made (including some interactive 3D graphs made with R/rgl that I have been itching to use), but unfortunately I ran out of time.


Tags: Graphics  Compression 



Cramming Entropy Encoding into Index Buffer Compression

12 March 2015

I was never really happy with the compression ratio achieved in the Vertex Cached Optimised Index Buffer Compression algorithm, including the improved version, even though the speed was good (if you haven’t read these posts, this one might not make sense in places). One of the obvious strategies to improve the compression was to add some form of entropy encoding. The challenge was to get a decent boost in compression ratio while keeping decoding and encoding speed/memory usage reasonable.

Index buffers may not seem at first glance to be that big of a deal in terms of size, but actually they can be quite a large proportion of a mesh, especially for 32bit indices. This is because on average there are approximately 2 triangles for each vertex in a triangle mesh and each triangle takes up 12 bytes.

Note, for the remainder of this post I’ll be basing the work on the second triangle code based algorithm, that currently does not handle degenerate triangles (triangles where 2 vertices are the same). It is not impossible to modify this algorithm to support degenerates, but that will be for when I have some more spare time.

Probability Distributions

From previous work, I had a hunch about what the probability distribution for the triangle codes was, with the edge/new vertex code being the most likely and the edge/cache code being almost as frequent. During testing, collected statistics showed this was indeed the case. In fact, for the vertex cache optimised meshes tested, over 98% of the triangle codes were one of the edge codes (including edge/free vertex, but in a much lower proportion).

Entropy encoding on the triangle codes alone wouldn’t be enough to make much of a difference (given the triangle code is only 4 bits and the average triangle about 12.5 in the current algorithm), without reducing the amount of space used by the edge and vertex cache FIFO index entries. Luckily, Tom Forsyth’s vertex cache optimisation algorithm (explained here) is designed to take advantage of multiple potential cache sizes (32 entries or smaller), meaning it biases towards more recent cache entries (as that will cause a hit in both the smaller and larger caches).

The probability distributions (charted below, averaged over a variety of vertex cache optimised meshes) are quite interesting. Both show a relatively geometric distribution with some interesting properties: Vertex FIFO index 0 is a special case, largely due to it being the most common vertex in edge cache hits. Also, the 2 edges usually going into the edge cache for a triangle (the other edge is usually from the cache already) share very similar probabilities, making the distribution step down after every second entry. Overall these distributions are good news for entropy encoding, because all of our most common cases are considerably more frequent than our less common ones.

Edge FIFO Index Probability Distribution
1. Edge FIFO Index Probability Distribution

Vertex FIFO Index Probability Distribution
2. Vertex FIFO Index Probability Distribution

The probability distributions were similar enough between meshes optimised with Tom’s algorithm that I decided to stick with a static set of probability distributions for all meshes. This has some distinct advantages; there is no code table overhead in the compression stream (important for small meshes) and we can encode the stream in a single pass (we don’t have to build a probability distribution before encoding).

One of the other things that I decided to do was drop the Edge 0/New Vertex and Edge 1/New Vertex codes, because it was fairly clear from the probability distributions that these codes would actually end up decreasing the compression ratio later on.

Choosing a Coding Method

Given all of this, it was time to decide on an entropy encoding method. There are a few criterion I had in picking an entropy encoding scheme:

  1. The performance impact on decoding needed to be small.
  2. The performance impact on encoding reasonable enough that on-the-fly encoding was still plausible.
  3. Low fixed overheads, so that small meshes were not relatively expensive.
  4. Keep the ability to stream out individual triangles (this isn’t implemented, but is not a stretch).
  5. Low state overhead (for the above).
  6. Switching alphabets/probability distributions multiple times for a single triangle (required because we have multiple different codes per triangle).
  7. Single pass for both encoding and decoding (i.e. no having to buffer up intermediates).

Traditionally the tool for a job like this would be a prefix coder, using Huffman codes. After considering the alternatives, this still looked like the best compromise, with ANS requiring the reverse traversal (meaning two passes for encoding) and other arithmetic and range encoding variants being too processor intensive/a pain to mix alphabets. Another coding considered was a word aligned coding that used the top bits of a word (32 or 64 bits) to specify the bit width of codes put into the rest of the word. This coding is simple, but would require some extra buffering on encoding.

After generating the codes for the probability distributions above, I found the maximum code-lengths for the triangle codes were 11 bits. One of the things about the triangle codes was that there were a fair few codes that were rare enough that they were essentially noise, so I decided to smooth the probability distribution down a bit at the tail (a kind of manual length limiting, although it’s perfectly possible to generate length limited Huffman codes), which reduced the maximum code lengths to 7 bits, with the 3 most common cases taking up 1, 2 or 3 bits respectively (edge/new, edge/cached, edge/free).

The edge FIFO indices came in at a maximum code length of 11 bits (with the most common 9 codes coming in between 2 and 5 bits) and the vertex FIFO indices came in with a maximum length of 8 bits (with the 7 most common codes coming in between 1 and 5 bits).

Implementing the Encoder/Decoder

Because the probability distributions are static, for the encoder we can just embed static/constant tables, mapping symbols to codes and code-lengths. These could then be output using the same bitstream code used by the old algorithm.

For the decoder, I decided to go with a purely table driven approach as well (it’s possible to implement hybrid decoders that are partially table driven, with fall-backs, but that would’ve complicated the code more than I would’ve liked). This method uses a table able to fit the largest code as an index and enumerates all the variants for each prefix code. This was an option because the maximum code lengths for the various alphabets were quite small, meaning that the largest table would have 2048 entries (and the smallest, 128). These tables are embedded in the code and are immutable.

To cut down on table space as much as possible (to be a good cache using citizen), each table entry contained two 1 byte values (the original symbol and the length of the code). In total, this means 2432 decoding table entries (under 5K memory use).

struct PrefixCodeTableEntry
{
    uint8_t original;
    uint8_t codeLength;
};

Symbols were encoded with the first bit in the prefix starting in the least significant bit. This made the actual table decoding code very simple and neat. The decoding code was integrated directly with the reading of the bit-stream, which made mixing tables for decoding different alphabets (as well as the var-int encodings for free-vertices) far easier than if each decoder was maintaining.

Here is a simplified version of the decoding code:

inline uint32_t ReadBitstream::Decode( const PrefixCodeTableEntry* table, uint32_t maxCodeSize )
{
    if ( m_bitsLeft < maxCodeSize )
    {
        uint64_t intermediateBitBuffer = *(const uint32_t*)m_cursor;
        
        m_bitBuffer |= intermediateBitBuffer << m_bitsLeft;
        m_bitsLeft  += 32;
        m_cursor    += sizeof(uint32_t);
    }
    
    uint64_t                    mask       = ( uint64_t( 1 ) << maxCodeSize ) - 1;
    const PrefixCodeTableEntry& codeEntry  = table[ m_bitBuffer & mask ];
    uint32_t                    codeLength = codeEntry.codeLength;

    m_bitBuffer >>= codeLength;
    m_bitsLeft   -= codeLength;

    return codeEntry.original;
}

Because m_bitBuffer is 64bits wide and the highest maxCodeSize in our case is 11 bits, we can always add in another 32bits at a time into the buffer without overflowing when we have less than the maximum code size left. This guarantees that we will always have at least the maximum code length of bits in the buffer, before we try and read from the table. I’ve used a 32 bit read from the underlying data here (which brings up endian issues etc), but the actual implementation only does this as a special case for x86/x64 currently (manually constructing the intermediate byte by byte otherwise).

Because we are always shifting codeLength bits out of the buffer afterwards, the least significant bits of the bit buffer will always contain the next code when we do the table read. The mask is to make sure we only use the maximum code length of bits to index the decoding table. With inlining (which we get a bit heavy handed with forcing in the implementation), the calculation of the mask should collapse down to a constant.

Results

Since last time there has been some bug fixes and some performance improvements for the triangle code algorithm (and I’ve started compiling in x64, which I should’ve been doing all along), so I’ll present the improved version as a baseline (all results are on my i7 4790K on Windows, compiled with Visual Studio 2013). Here we use 32 runs to get a value:

MODEL     TRIANGLES     UNCOMPRESSED     COMPRESSED     COMPRESSION TIME     DECOMPRESSION TIME (MIN)   DECOMPRESSION TIME (AVG)   DECOMPRESSION TIME (MAX)   DECOMPRESSION TIME (STD-DEV)
Bunny   69,630   835,560   108,908   2.42ms   0.552ms   0.558ms   0.599ms   0.0086ms
Armadillo   345,944   4,151,328   547,781   12.2ms   2.81ms   2.82ms   2.86ms   0.0103ms
Dragon   871,306   10,455,672   1,363,265   35.2ms   7.22ms   7.27ms   7.51ms   0.0724ms
Buddha   1,087,474   13,049,688   1,703,928   38.8ms   9.01ms   9.04ms   9.176ms   0.0304ms
                 

So, our baseline is a fair bit faster than before. Throughput is currently sitting over 122 million triangles a second (on a single core), which for 32bit indices represents a decoding speed of ~1,400MiB/s and about 12.5 bits a triangle. Here’s the new entropy encoding variant:

MODEL     TRIANGLES     UNCOMPRESSED     COMPRESSED     COMPRESSION TIME     DECOMPRESSION TIME (MIN)   DECOMPRESSION TIME (AVG)   DECOMPRESSION TIME (MAX)   DECOMPRESSION TIME (STD-DEV)
Bunny   69,630   835,560   70,422   2.52ms   0.827ms   0.828ms   0.844ms   0.0029ms
Armadillo   345,944   4,151,328   354,843   12.1ms   4.15ms   4.17ms   4.27ms   0.0303ms
Dragon   871,306   10,455,672   875,771   31.8ms   10.6ms   10.64ms   10.82ms   0.0497ms
Buddha   1,087,474   13,049,688   1,095,804   38.7ms   13.25ms   13.28ms   13.788ms   0.0687ms
                 

The new algorithm has throughput of a bit over 82 million triangles a second (so about 67% the throughput, ~940MiB/s), but gets very close to 8bits a triangle (so about 64% the size). There is virtually no overhead in the encoding time (with it even being slightly faster at times), probably due to less bits going into the bitstream. Considering our most common case in the old algorithm (edge/new vertex) was 9 bits, this is a considerable gain compression wise.

Conclusion

Overall, this algorithm is still not near the best indice compression algorithms in terms of compression ratio (which get less than a bit a triangle, when using adaptive arithmetic encoding), however it is fast, relatively simple (I don’t think an implementation in hardware would be impossible), requires very little in the way of state overhead/supporting data-structures/connectivity information and allows arbitrary topologies/triangle orderings (other algorithms often change the order of the triangles). Although, it is obviously focused on vertex cache optimised meshes, which happens to be the triangle ordering people want meshes in. The next step is probably to add vertex compression and create a full mesh compression algorithm, as well as handling degenerates with this algorithm.

As usual, the code for this implementation can be seen on GitHub.


Tags: Graphics  Compression 



Introducing Bitables

01 January 2015

An Introduction to Bitables and the Bitable Library

Bitables are a storage oriented data structure designed to exploit the benefits of OS optimization to provide a low overhead mechanism for fast range queries and ordered scans on an immutable key-value data-set. The immutability means they can be queried from multiple threads without locking. They’re also designed to be written out fast, using sequential writes, even with large amounts of keys and data. You can think of them as a hybrid between an SSTable and a b+ tree (although, currently they do not allow duplicate keys).

You can call this a thought experiment, because as I can imagine lots of different use cases for bitables, I have not had time yet to adequately explore them. However, preliminary benchmarks indicate that writing out a bitable is fast enough that they can be used for relatively dynamic structures, not just static ones. Some potential uses would be:

  • A lucene like inverted index (with segments and merging), when combined with something like roaring bitmaps.
  • A de-amortized log structured COLA like structure (with bitables at deeper levels and the higher levels stored in memory).
  • A static index where range or prefix searches are required.

Merging bitables is quite a quick operation, so they are suitable many places large SSTables would be used.

The basic premise is that bitables are built by adding keys/value pairs in key sorted order (like an SSTable), but they progressively build up a hierarchical index during this process (using very little in the way of extra resources to do so). Note that the library assumes you appending keys in an already sorted order. Writes of the hierarchical index are amortized across many keys and the operating system is left to buffer them up (the index being substantially smaller than the main data) to do as a batch.

There is a bitable library implementation, written in C (although, the included example is C++). This library serves as a relatively simple reference implementation, although it is designed to be usable for real life use. The library has both posix and win32 implementations, although it has only currently been tested and built on Windows and Linux (ubuntu). I don’t currently claim that the library is production ready, as it hasn’t been used fully in anger yet.

Here is an example of using the library to search for the lower bound of a range in a bitable, then scan forwards, reading 100 keys and values:

BitableCursor cursor;
BitableResult searchResult = bitable_find( &cursor, table, &searchKey, BFO_LOWER );

for ( int where = 0; 
      searchResult == BR_SUCCESS && where < 100; 
      searchResult = bitable_next( &cursor, table ), ++where )  
{
	BitableValue key;
	BitableValue value;

	bitable_key_value_pair( &cursor, &key, &value );

	/* Do something here with the key and value */   
}

How does it work?

Bitables consist of multiple levels:

  • The leaf level, which is the main ordered store for keys and small values.
  • The branch level(s), which are a hierarchical sorted index into the leaf levels.
  • The large value store, which stores values too large to fit in the leaf level, in sorted order.

Currently each level is stored in a separate file. When building the bitable, each file is only ever appended to (except when we write a final header page on completion), so writes to the individual files are sequential. The operating system is left to actually write these files to disk as it chooses.

The leaf and branch levels are organised into pages. Within each page, there is a sorted set of keys (and values for leaf pages), which can be searched using a binary search.

Building the bitable uses an algorithm similar to a bulk insertion into a b+ tree. Keys and small values are added to a leaf page (buffered in memory), as they are appended and when the page is full, the buffer is appended to the file and the next page begun. When a new page is begun, the first key is added to the level above it. New levels are created when the highest current level grows to more than one page.

Similar to a b+ tree, the first key in each branch level page is implicit and not stored. Unlike a b+ tree, branch level pages do not have a child pointer for every key, but only to the first child page. Because the pages in the level below are in order, we can work out the index to the page based on the result of the binary search.

Because we don’t have to worry about future inserts or page splitting, every page is as full as possible given the ordering, which should lead to the minimum possible number of levels for the structure (within constraints). This keeps the number of pages needing to be touched for any one query at a minimum, reducing I/O, page faults, TLB misses and increasing the efficiency of the page cache.

The implementation of bitables uses a trick that many b+ tree implementations use, having offsets to keys (and possibly values) at the front of each page and key/value data at the back. This allows each page to be built up quickly, growing inwards from both the back and the front until it is full. However, it has some disadvantages for processor cache access patterns for searching (sorted order binary searches are not very good for cache access patterns). In future, for search heavy workloads, it might be reasonable to make branch levels use an internal Van Emde Boas layout.

The Implementation

The implementation is available on GitHub and the documentation is available here. A premake4 file to create a build is included (and the Windows premake4 binary I used). On Ubuntu, I used the premake4 package (from apt-get). Let me know if there are any issues, or any other platforms you have tried it on.

The reference implementation relies on the operating system where possible (memory mapped files, the page cache) to do the heavy lifting for things like caching and ordering writes in a smart way. It is a bit naive in that the buffers it uses for writing out bitables are only one page in size (per level), so the number of system calls is probably a higher than it needs to be.

Reading the bitable uses a zero copy memory mapping system, where keys and values are come from pointers directly into the file (the mapping uses read only page protection, so it shouldn’t be possible to hose the data). Power of 2 alignments are supported for keys and values, specified when initializing the table for creation, so you can cast and access data structures in place. Currently omitting key or value sizes for fixed size keys/values is not supported, but this seems like a good future optimisation for many use cases.

Apart from initializing and opening the bitable, for read operations a few rules are followed:

  • No heap allocation (although, memory may be allocated into the page cache for demand caching by the operating system).
  • Cursors and statistics structures may be allocated on the stack.
  • No locking is required (including internally).
  • No system calls made (memory mapped IO).
  • There is no mutation of the bitable data structure/no reliance on internal shared mutable state.
  • All the code is re-entrant and there is no reliance on statics or global mutable state.

In the reference library, the bitable write completion process is atomic (the table won’t be readable unless it has been written completely), this is achieved by writing a checksummed header as the last step in the process. It also contains durable and non durable modes for finishing the bitable. The durable mode completes writes and flushes (including through the on-disk cache) in a particular order to make sure the bitable is completely on storage, the non-durable mode writes in the same order but does not perform the flushing (for use cases where a bitable is only used within the current power cycle).

The implementation has been designed so that the reading and writing parts of the bitable code are not dependent on each other (although, they are dependent on some small common parts), even though they are currently built into the library together. The library is small enough (it can fit in the instruction cache entirely) that you don’t really need to worry about this in general.

Platform and portability wise, the implementation also uses UTF-8 for all file paths (including on Windows). The table files produced by the bitable system use platform endianness etc, so they may not be portable between platforms.

Note that this is a side project for me and has been done in very limited time, so there is plenty of room for improvement, extra work and further rigor to be applied.

Does it actually perform decently?

All these benchmarks were performed on my Windows 8.1 64bit machine with 32GB of RAM, a Core i7 4790 and a Samsung 850 Pro 512GB SSD. All tests use 4KB pages and 8 byte key and value alignment. Note that these benchmarks are definitely not the final say on performance and I haven’t had time to perform adequate bench-marking for multi-threaded reads.

Our test dataset is using 129,672,136 key value pairs, with 24 byte keys (latitude and longitude as 8 byte doubles, as well as a 64bit unix millisecond time stamp), matching up to an 8 byte value. It’s 3.86 GiB, packed end to end.

Reading the data set from a warm memory mapped file and creating the bitable takes (averaged over 4 runs) 11.01 seconds in the non-durable mode and 13.08 seconds in the durable mode. The leaf level is 4.84 GiB and the 3 branch levels are 34 MiB, 240 Kib and 4 Kib respectively.

From this dataset, we then selected keys 2,026,127 keys evenly distributed across the dataset, then randomly sorted them. We then used these for queries point queries (on a single thread), which on a cold read of the bitable averaged 7.51 seconds (averaged over 4 runs again) and on warm runs 1.81 seconds (keys and values were read and verified). That’s well over a million random ordered key point queries per second on a single thread.




Page: 2 of 3