Warning, this will get very technical, very quickly. Not for the faint of brain. You have been warned.
Reusing vertex data with indexed drawing
Blender uses tessellation to convert quads and ngons to triangles, the only diet a GPU can consume. Currently blender writes all data associated with a triangle in a buffer and draws the buffer with a single command. For simplicity let’s consider an Ngon with 5 vertices with position and normal data. The problem is that this introduces quite a lot of data duplication as you can see in the following picture:
As is evident, Position 1, Normal 1, Position 3, Normal 3, and Position 4, Normal 4 are used duplicated for each triangle that uses those data. Notice that these data are identical. Blender does allow normals of different polygons to be different for each vertex, but vertices and normals that belong to the same polygon are the same. Therefore they can be reused.
OpenGL has an easy way to to reuse data, called “indexed drawing”. Using this, what we do is upload all vertex and normal data once and then use indices to create triangles from these data. This looks like this:
This does not only de-duplicate vertex data but has another benefit. GPUs have a small cache of vertex indices where vertices that are transformed by the vertex shader are stored. Every time a new vertex index is encountered, the GPU checks if the index exists in its cache and if it does, the GPU avoids all shader work on that vertex and reuses the result of the cache. Not only have we eliminated data duplication, but every time a duplicated vertex is encountered, we get it for free (almost).
Vertex indices require a small amount of storage, which is 1 integer per vertex, or 3 per triangle, so they are not free. Also, they only save us memory if we use quads and ngons. In full triangle mesh case, we end up using more memory (remember, those tricks work only if triangles belong to the same ngon). However, given that good topology is based on quad meshes and that data savings get more substantial with more complex data formats, the benefits by far outweigh the issues. Here we only contemplated on a simple position – normal format, but if we include UV layers, tangents and whatnot, cost per vertex is much higher and so is the cost of data duplication and the savings we get when we avoid it.
When benchmarking indexed drawing in the blender institute, I found a pleasant surprise: Even though full triangle meshes need some extra storage for indices and will not use tranformed vertex cache, the NVIDIA driver in my GPU still draws such meshes faster. This is quite weird because the data that are sent to the GPU are actually more but this is still a positive indication to go on with this design.
Finally, we can go even further and blend together vertices from nearby polygons whose data are exactly the same. This gets much more complex with heavier vertex formats and could lead to slower data upload due to CPU overhead to detect identical vertices. Also it breaks individually uploading loop data (see below) because any change in any data layer will potentially invalidate those identical indices.
Testing of this optimization in a local branch gives about 25% reduction in render time compared to master and those optimizations will be part of blender for version 2.76.
Easy hiding, easy showing
Using indexed drawing is not only useful for speed and memory savings. It also allows us to rearrange the order of how triangles are drawn on will. This is especially useful if we want to draw polygons with the same material together: Instead of rearranging their data, we just rearrange the indices (less data to move around). But there is one use case where indexed drawing can really help: Hidden faces.
A lot of blender’s drawing code checks if a face is hidden and then it displays it. This check is done every frame
for every face. However there are few tools that invalidate those hiding flags. In fact we don’t need to do this check every frame, but instead cache the result and reuse it.
By using indexed drawing, we can arrange the indices of hidden triangles to be placed last on the triangle list. This makes it quite easy to draw visible only triangles, by just drawing up to the place where the hidden triangle indices begin. Blender master now employs such an optimization in wireframe drawing which reduces drawing overhead by about 40%
Update data when needed
Another big issue with blender is that we upload data to the GPU too often. A scene frame change will cause every GPU buffer to be freed, causing a full scene upload to the GPU. Apparently we don’t want that. Instead we want a system where certain actions invalidate certain GPU data layers. For example, if UV data are manipulated it makes no sense to reupload position or normal data to the GPU. If modifier stack is comprised of deform only modifiers, it should have a way to only reupload position data to the GPU for final display.
For this we need a system like the dependency graph, where certain operations trigger an update of GPU data. Without such a system, the only way to ensure that we see a valid result is to upload all data again and again to the GPU every frame. Which is pretty much what is happening right now in blender for the GLSL/material mode.
GLSL material mode basically iterates through every face of every mesh in the scene every time the window is refreshed and gathers the same data over and over, independently of whether they have changed or not. If we want to avoid this we need to cache those data but if we do that, then we also need to be able to invalidate them if an operation changes them, so that they are uploaded to the GPU properly and the user sees the result of that operation.
This is not the result of crappy programming, rather it’s due to the history of how blender’s drawing code evolved from an immediate mode drawing pipeline, where the only way to draw meshes was to basically re-upload all data every frame.
Bottom line, fast GLSL view means having such a system in place. Fast PBR materials and workflow shaders from the Mangekyo project imply fast GLSL, which means having such a system. So it is no wonder that we first have to tackle such a target first if we want to have a fancier viewport with decent performance.