This Summer’s Sculpt Mode Refactor

Over the past several months sculpt mode underwent a large rewrite. Since the project has wrapped up, this post gives an overview of what changed.

Unlike most other development projects, this had no effect on the interface. Before and after the project, Blender looked exactly the same. Typically this should raise some eyebrows, because it often means developers are prioritizing work based on its effect on the code rather than its utility to users. In this case, problems with the code have made feature development significantly harder over the years, and refactoring came with plenty of potential performance improvements.

Overall, for those who want to skip all the technical details, entering sculpt mode in Blender 4.3 is over 5x faster, brushes themselves are about 8x faster, and memory usage is reduced by about 30%. For actual visible changes to sculpting in 4.3, see brush assets. For a full list of the refactor work, see the task.

Entering Sculpt Mode

Entering sculpt mode was known to be quite slow. Based on profiles, it also looked much slower than it should be, since it was completely single threaded.

It turns out Blender was bottlenecked by two things: building the BVH tree that accelerates spatial searches and raycasting and uploading the mesh data to the GPU for drawing.

Improving the BVH build time was a months-long iterative process of finding bottlenecks with a profiler, addressing them, and cleaning the code to make further refactoring possible. Adding trivial multi-threading to the calculation of bounds and other temporary data was the most significant improvement, at almost 5x. Beyond that, reducing memory usage improved performance by another 30%, simplifying the spatial partitioning of face indices using the C++ standard library another 30%. And finally, changing the BVH from storing triangles to storing faces (for a quad mesh there are half as many triangles as faces!) improved performance by another 2.3x.

Entering sculpt mode is about 5 times faster compared to 4.2 (a change from 11 to 1.9 seconds with a 16 million face mesh with a Ryzen 7950x).

Lessons for Developers

Any array the size of a mesh is far from free. We should think hard about whether all the data in the array is really necessary.
Any algorithm should clearly separate serial and parallel parts. Any loop that can be done in parallel should be inside a parallel_for.
We shouldn’t be reimplementing common algorithms like partitioning; that makes code so scary and weird that no one touches it for years.

Drawing

There is a fundamental cost of uploading geometry data to the GPU and we will always be bottlenecked to some extent by the large amount of data we need to render. However, as a tweaked version of code from 15 years ago, sculpt mode drawing had enough overhead and complexity that significant improvements were possible.

The GPU data for the whole mesh is split into chunks, with one chunk per BVH node. One main problem with the old data upload was its outer loop over nodes. That forced all the book-keeping to be duplicated for every node. Often just focusing on simplifying the code gave performance improvements indirectly.

Removing two levels of function call indirection for multires data upload roughly doubled the performance, and removing function calls for every mesh edge gave another 30% improvement.
The main change to the drawing code was a rewrite to avoid all duplicate work per BVH node, add multi-threading, and change the way we tag changed data. This improved memory usage by roughly 15% (we now calculate viewport wireframe data if the overlay is actually turned on), and entering sculpt mode became at least 10% faster.
GPU memory usage was reduced by almost 2x using indexed drawing to avoid duplicating vertex data for every single triangle. Now vertex data is only duplicated per face corner.
Previously, sculpting on a BVH node would cause every single attribute to be reuploaded to the GPU. Now we only reupload attributes that actually changed. For example, changing face sets only reuploads face sets. Tracking this state only costs a single bit per node.

BVH Tree Design

Previously, the sculpt BVH tree, often referred to as the “PBVH” (Paint Bounding Volume Hierarchy) was a catch-all storage for any data needed anywhere in sculpt mode. To reduce the code’s spaghetti factor and clarify the design, we wanted to focus the BVH on its goal of accelerating spatial lookups and raycasting. To do that we removed references to mesh visibility, topology, positions, colors, masks, the viewport clipping planes, back pointers to the geometry, etc. from the BVH tree. All of this data was stored redundantly in the BVH tree, so whenever it changed, the BVH tree needed to change too. Nowadays the design is more focused and it’s much easier to understand the purpose of the BVH.

Another fundamental change to the BVH was replacing each node’s references to triangles with references to faces. In a typical quad mesh there are twice as many triangles as faces, so this allowed us to halve a good portion of the BVH tree’s memory overhead.

Brush Evaluation

To evaluate a brush, regions (BVH nodes) of the mesh are first tested roughly for inclusion within its radius. For every vertex in each of these regions, we calculate a position translation and the brush’s strength. The vertex strength includes more granular filtering based on the brush radius, mask values, automasking, and other brush settings.

Prior to this project, all these calculations were performed vertex by vertex. For each vertex, we retrieved the necessary information, calculated the deformation and the relative strength and then finally applied the brush’s change. Because mesh data is stored in large contiguous arrays, it is inefficient from a memory perspective to process all attributes for a particular vertex at once, as this likely results in many cache misses and evictions.

While the previous code was somewhat concise, handling all three sculpt mesh types (“regular meshes”, dynamic topology, multires) at once, this “generic processing” had some significant negative side effects:

The old brush code was hard reason about because of C macros and the combination of multiple data structures in one loop.
The structure had little opportunity for improved performance because of runtime switching between data structures and the lowest-common-denominator effect of handling different formats.
A “do everything for each vertex” structure has memory access patterns that don’t align with the way data is actually stored.

Brush code now just processes a single action for all the vertices in a node at the same time, splitting the code into very simple hot loops which can use SIMD, use much more predictable memory access patterns, and have significantly less branching per-vertex.

For further reference, here is a change that refactored the clay thumb brush. Though the new code has more lines, it’s more independent, flexible, and easier to change.

Proxy System

Previously, brush deformations were accumulating into a temporary “proxy” storage on each BVH node. This accumulation occurred for each symmetry iteration until the end of a given brush step, at which point the data was written into the evaluated mesh positions, shape key data, and the base mesh itself.

We completely removed the proxy system as part of refactoring each brush. Instead, brushes now immediately write their deformation during each each symmetry step calculation. This avoids storing temporary data and improves cache access patterns by writing to memory that is already cached. Removing the proxy storage also reduced the size of BVH nodes by around 40%, which aligns with our ongoing goal of improving performance by splitting the mesh into more nodes.

Thread Contention

Profiling revealed a significant bottleneck during brush evaluation: just storing the mesh’s initial state for the undo system was taking 60% of the time. When something so simple is taking so much time, there is clearly a problem.

The issue turned out to be that most threads involved in brush evaluation were waiting for a lock while a single thread did a linear search through the undo data, trying to find the values for its BVH node.

for (std::unique_ptr<undo::Node> &unode : step_data->nodes) {  
  if (unode->bvh_node == bvh_node && unode->data_type == type) {
    return unode.get();
  }
}

Simply changing the vector to a Map hash table gave us back that time and significantly improved the responsiveness of brushes.

return step_data->undo_nodes_by_pbvh_node.lookup({node, type});

Though there was plenty of refactoring required to make this possible, the nice part is how often very little time with a profiler is necessary to identify significant improvements.

Undo Data Memory Usage

Undo steps also became slightly more memory efficient in 4.3. The overhead of each BVH node’s undo storage for a brush stroke reduced 10x from about 4KB to about 400 bytes.

In the future we would like to look into compressing stored undo step data. This could require significantly less memory.

For another example of thread contention, we look to the counting of undo step memory usage. Undo data is created from multiple threads, and each thread incremented the same memory usage counter variable. Simply counting memory usage later on with a proper reduction gave a 4% brush evaluation performance improvement.

Writing to the same memory from multiple threads at the same time is slow!

In yet another thread contention problem, writing “true” to a single boolean from multiple threads turned out to be a significant issue for the calculation of the average mesh normal under the cursor. The boolean was logically redundant, so just removing it improved brush evaluation performance by 2x.

Multi-Resolution Modifier

Most of these performance improvements were targeted at base mesh sculpting where there was more low-hanging fruit. However, multires changes followed the same design and there were a few more specific optimizations for it too. Most significantly, moving to a struct-of-arrays format for positions, normals, and masks gave a 32% improvement to brush performance, and simplified code.

*The sculpt-mode multires data structure was optimized the same way meshes were optimized over the past years (see last year’s conference talk)*

Some multires workflows have remaining bottlenecks though, like subdivision evaluation or bad performance at very high subdivision levels.

The End!

Thanks for reading! It was a pleasure to be able to iterate on the internals of sculpt mode. Hopefully the changes can be a solid foundation for many future improvements.