Jared Cascino

Animatable Nerf Preparation

2025-10-25T00:00:00+00:00

A framework for developing and comparing volume deformation models, visualized with a raymarching camera. These models can then be translated to use for real-time animated NeRFs.

The Problem

I want to decouple NeRF rendering from deformation to create a framework for animatable NeRFs. This would allow training a model on arbitrary volume deformations, rather than relying on input videos or physics simulations, to view the NeRF as if it were deformed. Implementation requires addressing several factors within the render pipeline.

Rendering a NeRF

The main step to a NeRF render is raymarching. Cast rays from the camera into a field, sample the NeRF model along these rays, and composite the samples into a final image. And if I want to deform this NeRF, I can represent it with two separate volumes. A "deformed" volume, where the camera resides, and a "canonical" volume, where the static NeRF resides. The camera casts straight rays through the deformed volume, since the resulting deformed NeRF is what I want to view for a given instance of the animation. Then in order to get the density and color of a NeRF at a certain ray sample in the deformed volume, I need a model/method to correlate these samples back to a point in the canonical volume. I will be referring this correlation back to the canonical volume as the "inverse deformation". In essence, I map straight rays from the deformed volume to learned, curved rays within the canonical volume. These curved rays sample the static NeRF, relaying its data to the correlated camera sample in order to render a deformed NeRF.

This inverse deformation problem is a fundamentally simple problem to understand, however it gets substantially more complex when real-time performance is a factor. Some key features for real-time performance of static NeRFs are found in InstantNGP's NeRF implementation: a faster model (compared to previous NeRF implementations), and a volumetric "mask" via a binary voxel-based occupancy grid (determines if the NeRF occupies a particular region in space). The first feature is straightforward to translate to a non-static NeRF, but the occupancy grid is where things could get out of hand.

Where is the NeRF?

Obviously, an occupancy grid works fine if the NeRF is static, but it will fail to function once the NeRF deforms. As the NeRF is no longer where the occupancy grid expects it to be. The inverse deformation model could be used at every sample until you reach the static occupancy grid in the canonical volume, but this largely ignores the speed-up of using the occupancy grid as a sampling mask. So why not just create another occupancy grid for the deformed volume? Caching occupancies for all deformed frames can definitely work for a singular deformation or even a short time-tracked animation. All I would need to do is define a frame index I want to render, raymarch with that frame's deformed occupancy grid, and only sample the inverse deformation model and the NeRF when rays pass through this deformed occupancy grid. However, what about expanding past fixed framecount animations? Such as animating based on a combination of various parameters. Anywhere from pose animation with a skeleton to face animation via blendshapes to secondary effects such as muscle or hair simulation based on the more strict parameters. A 128^3 binary occupancy grid may only be 0.2 megabytes, but pre calculating and storing millions of these to handle every scenario would be a poor use of computation power and memory.

So the problem becomes: how can this be avoided? One strategy is to use a simple, direct point-to-point model for the inverse deformation correlation, while a separate, complex model handles predicting the required occupancy grid in real-time. It is an interesting approach that I would like to test in the future, but the implementation I took for this project takes a different approach.

Implementation of the Inverse Deformation Model

Theres two aspects I want to preserve in a real-time focused implementation of an animatable NeRF: keep an occupancy grid for sample masking, and prevent multiple inverse deformation model samples per ray before the occupancy grid is reached. Meaning, I wanted a way to get the resulting full deformed ray with one pass of the inverse deformation model, allowing for much quicker sampling before the occupancy grid is reached. No going between the CPU and GPU after each inverse deformation model sample to see which points should start sampling from the NeRF model and which are still in empty space. Knowing that the ray sampling in the canonical space is a warped/curved version of the ray in the deformed volume, I figured replicating that curve with parametric curves is the way to go.

Model Progression

Before Hopping directly into a parametric curve implementation, I worked my way through more complex/abstract models. In order of testing, they are:

1. Direct Point Correlation Model

For any point (x, y) in the deformed volume, output the point (x’, y’) that correlates to the correct point in the canonical volume

2. Point Offsets Model

For any point (x, y) in the deformed volume, output the offsets (i, j),that makes (x + i, y + j) correlate to the correct point in the canonical volume

3. Ray Origin Model

For a given origin point of a camera ray (x, y) and t for distance along the ray, output the offsets (i, j), that makes (x + i, y + j) correlate to the correct point in the canonical volume

4. Batch Ray Origin Model

For a given origin point of a camera ray (x, y), output an array of offsets along the entire ray < i, j >, that makes each (x + i, y + j) at a given t distance along the ray correlate to the correct points in the canonical volume

5. Akima Spline Model

For a given origin point of a camera ray (x, y) and ray angle , output a set of akima spline knot parameters < i, j, t >to define parameterized offset splines for each axis: < i, t > for the x axis, and < j, t > for the y axis. The predicted t values are shared between the x axis spline and y axis spline. Use the t value for the current camera ray sample to perform the standard functions for an Akima spline and combine with the camera sample (x, y) to get the corralated point in the canonical volume

6. Entry/Exit Akima Spline Model

For a given entry point into the deformed volume’s bounding box (x_entry, y_entry) and a given exit point of the deformed volume’s bounding box (x_exit, y_exit), output the same splines as the Akima Spline Model

Training Environment

The majority of the values (deformation parameters, camera position/rotation/resolution) are configurable. The gif above shows how the ray count can be changed on the fly. In fact, since the inverse deformation model only considers one ray in its model size for input/output, you can change the "resolution" without retraining the model

The visualization for the environment is:

The red area is the canonical space
The green area is the deformed space
The black dots represent the canonical object when in the canonical space and the correlated deformed points in the deformed space
The yellow rays represent the camera rays
The pink dots are the spline knots
The red dots are the predicted values (calculated using the spline knots)
The blue dots are the ground truth samples

The training data is a jittered sample of camera positions along an orbiting path

Results

The model does a decent job tracking the deformation throughout a camera orbit, but there are some noticeable points of improvement. Of course, overall accuracy could improve, but also there is a "wobble" between very small camera movements. I have tried various loss calculation improvements, such as taking into account ray smoothness, ray neighbor consistency, and ray reprojection straightness. They have not resulted in significant improvements for solely the orbital camera path, but more testing needs to be done to find the optimal weightings for these loss functions. Also, there was not a significant difference between using the camera ray origins versus using the entry/exit points, however I believe this is mostly due to the training data being limited to a mostly-fixed distance orbit around the object. Entry/exit should help the model generalize with free camera movement.

Optimized Centers of Rotation Skinning

2025-10-17T00:00:00+00:00

Implementing the optimized centers of rotation skinning proposed by this paper in Unity3D. This project includes custom implementations of linear blend skinning (LBS), dual quaternion skinning (DQS), and optimized centers of rotation skinning (Opt. CoR).

Results

There are two aspects of performance relavent for this code: pre-processing and real-time.

Pre-processing times

For pre-processing, the bone weights, bone ids, and optimized CoRs are baked into the UVs of the mesh. The runtime of this preprocessing depends on the size of the mesh and number of bones. However, with utilizing a Compute Shader for the compute-heavy task of finding the optimized CoRs for each vertex, the entire baking process takes fractions of a second. During my testing, given my PC, the baking for a fully rigged character took on average 300-350 miliseconds.

Real-time usage and bottleneck

In a realistic setting, meaning a realistic number of characters with close or medium distance LODs, this alternate skinning is perfectly viable. For example, even with hundreds of characters using Opt. CoR skinning, the game still runs at hundreds of fps. However, compared to the built-in skinning, there certainly is a drop in performance. This is due to a bottleneck created from needing to transfer the bone transform data from the CPU to the GPU every frame. For a future implementation, I would like to explore the use of ECS to alleviate this bottleneck.

CUDA Cloth Simulation

2025-10-04T00:00:00+00:00

Your browser does not support the video tag.

A CUDA-based GPU cloth simulation (mass-spring system) with self collision, object collision, particle pinning, gravity, and wind forces. The goal of this project is developing and optimizing a cloth simulation to be able to adequately run in real-time by utilizing GPU computation.

Structure

The basis of the implementation for the cloth simulation used is a mass-spring system. The basic construction of a mass-spring system is: a collection of particles (point masses), connected by a series of springs. The specifics of how the springs connect to certain masses are what allows the system to represent a type of simulation. In the case of a cloth simulation, the masses represent each vertex on a cloth mesh. To focus on real-time performance rather than developing a system to create a mass-spring system for any mesh, a simple, subdivided plane was used. This simplification allows for more directly scalable computation and is still relevant since the subdivided plane can be seen as a patch of a more complex mesh. The majority of any piece of cloth will reflect the properties of the patch due to how the springs connect to each mass.

The layout of the springs is portrayed in the figure to the right. Each dot represents neighboring particles, and the hollow dot in the center represents the particle of interest. If the particle of interest is located at coordinates ( i , j ) in a 2D arrangement of particles, the springs attached to it are as follows:

stretch springs (blue): (i, j) connects to each (i+1, j), (i-1, j), (i, j+1), and (i, j-1)
shear springs (green): (i, j) connects to each (i+1, j+1), (i-1, j+1), (i-1, j+1), and (i-1, j-1)
bend springs (red): (i, j) connects to each (i+2, j), (i-2, j), (i, j+2), and (i, j-2)

Outside of understanding the basic structure of the system, the particles do not need to actually be referenced in a 2D manner. Only referring to the set of particles as a 1D array allows for a more generic and streamlined system. However, what does benefit from a 2D arrangement is the collection of springs connecting those particles. That is because one spring can theoretically connect to any two particles in a system. As such, the layout for the spring collection is a 2D matrix with dimensions n by n where n is equal to the total particle count. This results in a sparse matrix in which each occupied element stores the information for the spring connecting two different particles.

An example is seen in the figure above, which displays the sparse spring matrix for a mass-spring system with 25 particles (a 5x5 patch). To prevent clutter, the filled elements are only showing one piece of data (the rest length of the spring). There are a few key features of this sparse matrix that are important to note.

One feature is that the matrix is diagonally symmetric. This is due to the aforementioned facts that the matrix’s dimensions are the total particle count in each dimension, and that a spring connects two particles. Because of this, if one particle with the ID x is connected to another particle with the ID y , the same spring will be located at both ( x , y ) and ( y , x ).

The most important feature to note builds off of the previous feature. Every spring that connects to a particle can be found in its respective row or column. That means when calculating the spring forces for each particle, only one row (or column) needs to be accessed to find the total spring force directly affecting the particle.

However, the sparse matrix is still very wasteful to use directly in computation. With how the springs and mesh are defined, there is a maximum of 12 springs per particle, but there may be tens of thousands of particles. Meaning only 12 elements per row of tens of thousands of elements are actually needed for computation. The maximum spring count could be a bit different if a more irregular mesh was used, or if different springs (like sewing springs) were added to the calculation, but it will still be significantly less than the total number of particles, especially at the sizes considered for GPU utilization.

Considering how sparse the matrix is, it would be wise to use a sparse matrix storage system. Because of the diagonal nature of the array, the first thought may be a DIA format, but doing so would not be beneficial. The elements are fairly close to the diagonal when the particle count is low, but they spread out as the particle count increases, which is not ideal for DIA. Also accessing per diagonal would mean the calculation would result in a scatter implementation rather than a gather implementation. In order to keep the main calculation as generic as possible (allowing for any piece of cloth or pieces of cloth as long as there is some spring matrix defined for it/them), a compressed sparse row (CSR) format is used. With CSR, a single compressed row can be used per particle in the calculation of spring forces.

The main spring system is now fully defined, but there are also some extra features including colliders, external forces, and pinning. Most are either single constants or small arrays, so they are not worth being thoroughly discussed in this section.

Calculation

With the structure of the program defined and initialized, the calculation of the simulation can begin. The basis of how the calculation is performed is with explicit integration, which presumes the velocity and acceleration of the particles are constant during a time step, and those values are used to solve for the next time step. This is an intuitive and fast way to calculate the forces in a simulation but is prone to being unstable if the time step is too large. The goal fps of 30 would result in way too large of a time step, so the simulation runs several times per frame. The time step chosen for the cloth simulation is 1000Hz. Thus, the simulation runs 34 times a frame (the last time step in the frame has a shorter length to compensate for fractions of steps per frame).

In each time step, there is a loop that runs through each particle in the simulation (this loop is replaced by a kernel for the GPU implementation). So, one iteration of the loop represents the calculation for the next time step for one particle. The data for the particle of interest gets copied to a local variable (since the particle data is considered constant during the time step). The local particle ( p ) and a local variable to keep track of the total force on that particle ( f ) are the two focus variables. The calculations performed on those variables are as follows:

Loop through the corresponding row of the CRS and calculate the spring and damping force (per element in that CRS row)
- Spring force using Hooke’s Law
- Damping force: force applied to oppose a fraction of the motion
Add wind/gravity force vector
Check collision with objects
- Proximity check (per object)
- If colliding, cancel out the velocity and force component in the direction of the object normal
Check collision with the floor
- Proximity check
- If colliding, cancel out the velocity and force in the z direction
Check collision with other particles
- Proximity check (per particle)
- If colliding, apply elastic collision
Use the new force and velocity to find the velocity and position for the next time step and assign it to the local particle.
Set the corresponding return array element to the updated local particle

Suitability for GPU Acceleration

For the GPU implementations, the main focus was the real-time aspect of the code. The creation of the CSR has a lot of divergences, is mostly sequential, and only runs once for a short period of time, so it would be a poor fit for converting to run on the GPU. On the other hand, everything that is done per frame (aside from launching the kernels the correct number of times per frame) can be transferred onto the GPU and stay on the GPU.

The explicit integration implementation used for this project is suitable for GPU acceleration because as long as the time step is suitably low enough, the calculation for the new particle data only needs to rely on its current data and the current data of the particles directly connected to it. Thus, the spring forces can be gathered from the connected particles and assigned to an individual particle per thread. Also, since only one particle is being written per thread, and the particle array is a 1D array that can grow rather quickly with more detailed pieces of cloth, the blocks on the GPU will be filled and occupied entirely, except for the very last block for the array.

GPU Implementations

There were a number of iterative implementations of the GPU version of the code which are all described in this section.

In the initial naive implementation, the code was mostly only changed enough to be able to run on the GPU. Each thread was assigned a particle, and it would be written to a separate but identical buffer at the end of the kernel. To prevent race conditions, the particle at the current index was copied to a register. The particle data was the only non-local data that explicitly needed this in order for the kernel to run properly, but any global data that was being read multiple times within the kernel was also copied to a register.

The next implementation was a small change to see how constants would benefit the execution. Every argument that was not being written to was given the const keyword. This was not expected to make a huge difference since the only variables able to fit in constant memory were scalar variables and the fairly small object array.

The next implementation was focused on seeing how optimizing the math during the computation could affect performance. There were a couple of small changes made, but the one expected to make the largest difference was the switch from using distance to the squared distance to determine proximity for collisions. The square root operation is expensive, so omitting it and squaring the threshold value would result in the same comparison being made.

It was unclear if this next implementation would cause a speedup, but it was worth trying. As mentioned in the calculation, there are two particle buffers being passed in: the input and the result buffers. After the kernel is done running, the result buffer is copied back to the input buffer so the next time step has the updated data for the particles. This implementation changed the CPU logic for running each kernel per time step. Rather than there being explicitly one input buffer and one result buffer, the two buffers swapped roles each time step. For example, after a single time step, the result buffer is holding the updated information while the input buffer is now outdated. For the kernel in the next time step, the result buffer is passed as the input buffer and the outdated input buffer is passed as the result buffer so the old data can be overwritten. This avoids the need to copy back the buffer but adds some CPU logic.

The next implementation was a big shift for the self-collision section of the code. Rather than checking the current particle against every other particle for a potential collision, binning was used to limit the number of comparisons to only those close to the current particle. This in and of itself changes the scaling of the problem since the speed of the self-collision is no longer directly dependent on the total particle count. In this step, the update kernel for the bins was kept naive and foolproof to ensure the binning was working properly. In the update kernel, each thread was assigned a bin and it checked against all the particles to find the particles contained inside the bin. To prevent unnecessary amounts of calculation, the update kernel could be launched only every frame rather than every time step, because the neighboring particles were unlikely to change significantly between time steps.

The next and final implementation was focused on improving the efficiency of the bin update kernel. Rather than the bin checking against all particles, only the particles in the neighboring bins (a 3x3x3 area surrounding the current bin) were considered. It is very unlikely that a particle will skip an entire bin over the course of a frame. The simulation would break before that could happen. So this is a way to significantly reduce the number of particles checked, and prevent the bin update kernel from scaling based on the total particle count.

Results

Your browser does not support the video tag.

Speedup Over CPU at 25.6k particles

Performance metrics were taken with a AMD Ryzen 7 3750H and NVIDIA GTX 1660 Ti
For a more recent benchmark, the “Binned Neighbors” implementation runs at ~5 ms frame time (~200 fps) using a RTX 4070

Technique	Speedup over CPU
CPU	1.00x
Naive GPU	108.51x
Constants	110.88x
Math	154.90x
Buffer	154.68x
Binned Refresh	156.70x
Binned Neighbors	1,635.58x

A series of different size meshes were used to test each implementation at various particle levels. The sizes 5x5, 10x10, 20x20, 40x40, 80x80, 120x120, and 160x160 result in the particle counts of 25, 100, 400, 1600, 6400, 14400, and 25600 respectively. The graph displays the averaged frame time for each implementation as the particle count increases. Log scales are being used for a clearer comparison. As expected, the CPU has better utilization at very low particle counts; meanwhile, the GPU is being underutilized for any of the implementations. However, as the particle count increases, the GPU implementations easily beat out the CPU implementations.

References

Stuyck, Tuur. Cloth Simulation for Computer Graphics. Morgan & Claypool Publishers, 2018.

Shiraishi. (2015). simpleGL . Retrieved from https://github.com/zchee/cuda-sample/blob/master/2_Graphics/simpleGL