Home Software Development Trash!
This is just an interesting little tidbit that I stumbled across a fair few years ago (2006 or so I think) and have been meaning to blog about (I might have and forgotten) ever since. There is a neat trick of geometry that allows you to solve yes-no visibility queries from a point for polygon soups with nothing more than ray-casting and a few tricks of geometry, with no polygon clipping and no sorting.
I’m not claiming this will actually be useful to anyone, in fact, floating point precision problems make it difficult to do in practice. But, it may come in handy for someone and it may well be practical for drastically simplified versions of a scene.
Firstly, we’ll talk about visibility queries without any view restrictions (omni-directional from a point). A simple assertion; for non intersecting geometry (intersecting geometry can be handled, with the caveat that the line segments where surfaces intersect must be treated as an edge), a polygon or polygonal object is visible if a ray from the query point passing through 2 edges on the silhouette of any object in the scene hits the object without being occluded.
If you are familiar with the rule for visibility queries between polygons; that there must exist an extremal stabbing line going through 4 edges between 2 polygons for them to be visible to each other, this is a natural extension to that, because the constraint of the point in the query (the eye, the camera, or whatever you want to call it) is the equivalent to the constraint of 2 edges. So we are left with the point and 2 edges to define an extremal stabbing line; of which one must exist for the visibility query to return a positive result.
Now, there is the case of determining what edges, in particular, will provide us with a ray to cast and what the ray will be. Luckily, there is a straight forward answer to this:
- They must be on the silhouette of an object and if the object is a polygon connected to other polygons, this includes the edges of the polygon in the query.
- For such a ray to exist for 2 edges and point, each edge must straddle the plane through the other edge and the query point.
- The ray going through these 2 edges is the intersection of these planes; so it must be perpendicular to both planes. This means the direction of the ray is the cross product of the normals of the planes going through each edge and the query point.
- The origin of the ray is obviously the query point.
- This ray must obviously intersect with the query object too (if one or more of the edges is on the query object, this is a given).
There is of course a special case to this; 2 edges often intersect at a vertex, in which case there will exist a ray going through the query point and the vertex.
Anyway, to solve the query, you enumerate all the query point silhouette edge-edge cases and see if you find one that hits the query object without being occluded by another object. If such a ray exists, then the object is visible. You could do a whole scene at once by enumerating all the query point edge-edge cases and knowing that only the objects that got hit were visible.
You could, potentially, do something similar to a beam-tree to accelerate finding which edges straddled the planes of other edges. If I were to make a terrible pun, I would say this is a corner case of a beam-tree (in that, the rays in question are essentially the corners of significant plane intersections in the beam-tree).
This also works in reverse for the view frustum and portal cases; in this case the ray through the query point and edges must pass through the portals and the view frustum. This allows you to perform portal queries without clipping the portals.
There is also, I suspect, a fair bit of temporal coherence for the edges involved in a visibility query; if the query point moves only slightly, it is likely the same edges will produce an extermal stabbing line that proves visibility.
I’ve had a second pass at the compression/decompression algorithm now, after some discussion with davean on #flipCode. The new iteration of the algorithm (which is available alongside the original) is a fair bit faster for decompression, slightly faster for compression and has a slightly better compression ratio. The updated algorithm doesn’t however support degenerate triangles (triangles that have a duplicate vertex indice), where as the original does.
The original is discussed here.
The main change to the algorithm is to change from per vertex codes to a single composite triangle code. Currently for edge FIFO hits there are two 2 bits codes and for other cases, three 2 bit codes. There are a lot of redundant codes here, because not all combinations are valid and some code combinations are rotations of each other. The new algorithm rotates the vertex order of triangles to reduce the number of codes, which lead to one 4 bit code per triangle. This nets us a performance improvement and a mild compression gain.
There were also some left over codes, so I added a special encoding that took the edge + new vertex triangle case and special cased it for the 0th and 1st edges in the FIFOs. This could potentially net very big wins if you restrict your ordering to always use a triangle adjacent to the last one, trading vertex cache optimisation for compression.
Here’s the performance of the old algorithm:
|MODEL||TRIANGLES||UNCOMPRESSED||COMPRESSED||COMPRESSION TIME||DECOMPRESSION TIME|
And here is the performance of the new one:
|MODEL||TRIANGLES||UNCOMPRESSED||COMPRESSED||COMPRESSION TIME||DECOMPRESSION TIME|
Overall, there seems to be a 30%-40% speedup on decompression and a modest 3% improvement on compression.
Also, I’d like to thank Branimir Karadžić, whose changes I back-ported, which should hopefully add support for 16bit indices and MSVC 2008 (unless I’ve broken them with my changes - I don’t have MSVC 2008 installed to test).
If you haven’t checked out the source, it’s available on GitHub.
Update - There is a new post about an updated version of the algorithm listed here.
A few weeks ago, I saw Fabian Giesen’s Simple lossless(*) index buffer compression post (and the accompanying code on GitHub) and it made me think about a topic I haven’t thought about in a while, index buffer compression. A long time ago, at university, I built a mesh compressor as a project. It had compression rates for triangle lists of less than 2bits a triangle (and did pretty well with vertices too), but it had a few limitations; it completely re-ordered the triangles and it required that at most 2 triangles share an edge. The performance was semi-decent, but it was also relatively heavy and complicated, as well as requiring a fair bit of dynamic allocation for maintaining things like connectivity structures and an edge stack.
Re-ordering the triangles has the disadvantage that it destroys post-transform vertex cache optimisation (although, it did optimise order for the pre-transform vertex cache). Now, the vertex cache utilisation of these algorithms isn’t generally catastrophic, as they use locality, but definitely worse than those produced by Tom Forsyth’s algorithm.
Fabian’s algorithm has the advantage of keeping triangle order, while performing better when vertex cache optimisation has been performed. It’s also incredibly simple, easy to implement and uses very little in the way of extra resources. The key insight is that triangles that have been vertex cache optimised are likely to share an edge with the previous triangle.
But what if we extend this assumption a little, in that a triangle is also likely to share edges and vertices with other recent triangles. This is pretty obvious as this is exactly what the vertex cache optimisation aims for when it re-orders the triangles.
Post-transform vertex caches in hardware are typically implemented as fixed size FIFOs. Given that is a case we’ve already optimized for (and fixed sized ring-buffer FIFOs are very cheap to implement), this is the concept we’ll use as the basis for our algorithm. As well as a vertex index FIFO cache, we’ll also introduce an edge FIFO cache that has the most recent edges.
What about cache misses? Well, these fall into two categories; new vertices that haven’t yet been seen in the triangle list and cache misses for repeated vertices. If we re-order vertices so that they are sorted in the order they appear in the triangle list, then we can encode them with a static code and use an internal counter to keep track of the index of these vertices.
Cache misses are obviously the worst case for the algorithm and we encode these relative to the last seen new vertex, with a v-int encoding.
So, we end up with four codes (which, for a fixed code size, means 2 bits):
- New vertex, a vertex we haven’t seen before.
- Cached edge, edges that have been seen recently and are in the FIFO. This is followed by a relative index back into the edge FIFO (using a small fixed bit with; 5bits for a 32 entry FIFO).
- Cached vertex, vertices that been seen recently and are in the FIFO. This is followed by a relative index back into the vertex FIFO (same fixed bit with and FIFO size as the cached edge FIFO).
- Free vertex, for vertices that have been seen but not recently. This code is followed by a variable length integer encoding (like that seen in protobuf) of the index relative to the most recent new vertex.
Triangles can either consist of two codes, a cached edge followed by one of the vertex codes, or of 3 of the vertex codes. The most common codes in an optimised mesh are generally the cached edge and new vertex codes.
Cached edges are always the first code in any triangle they appear in and may correspond to any edge in the original triangle (we check all the edges against the FIFO). This means that an individual triangle may have its vertices specified in a different order (but in the same winding direction) than the original uncompressed one. It also simplifies the implementation in the decompression.
While we use a quite simple encoding, with lots of fixed widths, the algorithm has also been designed such that an entropy encoder could be used to improve the compression.
With the edge FIFO, for performance on decompression, we don’t check an edge is already in the FIFO before inserting it, but we don’t insert an edge in the case that we encounter a cached edge code. This is the same for cached vertices. This means the decompression algorithm does not have to probe the FIFOs.
I’ve provided a simple proof of concept implementation on GitHub. These results were obtained with a small test harness and this implementation, using Visual Studio 2013 and on my system (with a bog standard i7 4790 processor).
To make things consistent, I decided to make my results relative to those provided in Fabian’s blog, using the same Armadillo mesh. His results use a file that encodes both the positions of the vertices and the indices in a simple binary.
The Armadillo mesh has 345,944 triangles, which is 4,151,328 bytes of index information. After running a vertex cache optimisation (using Tom Forsyth’s algorithm) and then compressing with my algorithm, the index information takes up 563,122 bytes, or about 13 bits a triangle. By comparison, just running 7zip on just the optimised indices comes in at 906,575 bytes (so, it quite handily beats much slower standard dictionary compression). Compression takes 17 milliseconds and decompression takes 6.5 milliseconds, averaged over 8 runs. Even without the vertex cache optimisation, the algorithm still compresses the mesh (but not well), coming in at 1,948,947 bytes for the index information.
For comparison to Fabian’s results and using the same method as him, storing the uncompressed vertices (ZIP and 7z results done with 7-zip):
|STAGE||SIZE||.ZIP SIZE||.7Z SIZE|
|Compression (Vertex Optimised)||2577k||1518k||1276k|
Performance wise, the algorithm offers linear scaling, decompressing about 50 million triangles a second and compressing about 20 million across a variety of meshes with my implementation on a single core. For compression, I used a large pre-allocated buffer for the bitstream to avoid the overhead of re-allocations. Compression is also very consistently between 12 and 13bits a triangle.
The compressed and uncompressed columns in the below table refer to sizes in bytes. The models are the typical Stanford scanned test models.
|MODEL||TRIANGLES||UNCOMPRESSED||COMPRESSED||COMPRESSION TIME||DECOMPRESSION TIME|
Notes About the Implementation
Firstly, the implementation is in C++ (but it wouldn’t be too hard to move it to straight C, if that is your thing). However, I’ve eschewed dependencies where-ever possible, including the STL and tried to avoid heap allocation as much as possible (only the bitstream allocates if it needs to grow the buffer). I’ve currently only compiled the implementation on MSVC, although I have tried to keep other compilers in mind. If you compile somewhere and it doesn’t work, let me know and I’ll patch it.
The implementation provides a function that compresses the indices (CompressIndexBuffer) and one that decompresses the indices (DecompressIndexBuffer), located in IndexBufferCompression.cpp/IndexBufferCompression.h and IndexBufferDecompression.cpp/IndexBufferDecompression.h. There’s also some shared constants (nothing else is shared between them) and some very simple read and write bitstream implementations. While this is a proof of concept implementation, it’s designed such that you can just drop the relevant parts (compression or decompression) into your project without bringing in anything else. Compression doesn’t force the flushing of the write stream, so make sure you call WriteBitstream::Finish to flush the stream after compressing, or the last few triangles might be incorrect.
There has been little optimisation work apart from adding some force-inlines due to compiler mischief, so there is probably some room for more speed here.
I’ve allocated the edge and vertex FIFOs on the stack and made the compression and decompression functions work on whole mesh at a time. It’s possible to implement the decompression to be lazy, decompressing a single (or batch) of triangles at a time. Doing this, you could decompress in-place, as you walked across the mesh. I’m not sure or not if the performance is there for you to do this in real-time and that probably varies on your use case.
On reviewing the new C++ standard, C++14, I’ve come to the conclusion that modern C++ has no home. C and a very limited subset of C++ (colloquially known as C+, which hasn’t seen many improvements since C++98), are at home as systems languages, because very few languages offer the capability to work with constructs that are known to be exculpated from unpredictable performance overhead. They also offer the ability to play with memory in a very raw but controlled way, which can be very useful for systems languages. There are a plethora of higher level languages, run-times and frameworks which are far better at being application languages (both server and client side) than C++ and offer good enough performance, better semantics, far superior standard libraries and far greater productivity than modern C++.
On top of this, C++ has continued to become more complicated (C++98 was already a complicated and semantically obtuse beast) and even as someone who has been working with the language for over 15 years, it can be non-trivial to understand what someone using some of the newer features is trying to achieve. Beyond that, someone trying to make the features of C++ work precludes features being added that actually add something to the language (type safe discriminated unions and de-structuring!). Compile times somehow manage to get worse with each iteration of the language and compiler error messages when something goes wrong with modern C++ mechanisms are notoriously hard to understand. To top this off, the non-trivial semantics make it hard to reason about the performance implication of code when that is the only reason left to use C++ over other languages that offer higher level features.
A large part of the reason for this is the persistence with using textual replacement as the main means to achieve anything in C++. This is the hangover from C, where header files, clumsy separable compilation and the preprocessor were king. The entire problem with templates and C++ compile times in general comes from this mechanism, where instead of relying on sensible and well-defined compiler semantics that transform code into well-defined types that the compiler can hold easily in memory between compilation units (where the compiler only needs to know the symbols and types match, outside of optimisation), templates are essentially an expanded blob of text with some symbols replaced and they have to be expanded (with checks) for every separate compilation unit (sometimes several times), where the semantics might be different each time. In fact, every type and function known to the compiler has to be fully expanded and parsed (with possibly changing semantics) for every compilation unit. Is there a greater sign that this mechanism is broken than the pimpl concept?
Now, in some circles, what I’m going to say is blasphemy, but, really it is time for language mechanisms based on textual replacement to have a pillow put over their collective faces and for the rest of us to throw a drinking fountain through the window and escape. It’s the reason we can’t have nice things. Other languages manage to provide generic typing and meta-programming mechanisms without mile long compiler messages or one hour compile times (yes, even with types that have different sizes and don’t require boxing). I often see C++ programmers complaining about C++ language features as if the concepts (like generic typing, lambdas/closures and meta-programming) are themselves bad, when it is not the concept, but the C++ implementation that is so horrible.
As an aside, exceptions are also implemented poorly in C++, but for other reasons. In the exception holy war, people tend to pick a side based on whether performance or robustness is the primary goal and C++ can get away with a poor exception handling mechanism because many of the people that use it are on the performance side of the fence and avoid exception handling anyway.
Yes, we have been promised a module system, but I don’t really think it will ever be enough, because the language already brings too much baggage. I think it has actually reached the stage where we need a new language in this space, that offers what C+ and C offer (apart from textual replacement and the clumsy separable compilation mechanism), while bringing in far more sensible versions of the features C++ is trying to provide.
I’m attempting a return to regular writing a regular blog, because I need somewhere to get ideas down and doing it in a way where you can’t be ridiculed in public is just boring.
Page: 3 of 3