<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://jred1.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://jred1.github.io/" rel="alternate" type="text/html" /><updated>2025-11-05T22:57:58+00:00</updated><id>https://jred1.github.io/feed.xml</id><title type="html">Jared Cascino</title><subtitle>Software Engineer</subtitle><entry><title type="html">Animatable Nerf Preparation</title><link href="https://jred1.github.io/Ani-Nerf/" rel="alternate" type="text/html" title="Animatable Nerf Preparation" /><published>2025-10-25T00:00:00+00:00</published><updated>2025-10-25T00:00:00+00:00</updated><id>https://jred1.github.io/Ani-Nerf</id><content type="html" xml:base="https://jred1.github.io/Ani-Nerf/"><![CDATA[<p><img style="margin: 10x 0px 10px 0px;" src="/images/ani_nerf_2D.gif" /></p>
<div class="clear">
A framework for developing and comparing volume deformation models, visualized with a raymarching camera. These models can then be translated to use for real-time animated NeRFs.
</div>
<h2>The Problem</h2>
<p>I want to decouple NeRF rendering from deformation to create a framework for animatable NeRFs. This would allow training a model on arbitrary volume deformations, rather than relying on input videos or physics simulations, to view the NeRF as if it were deformed. Implementation requires addressing several factors within the render pipeline.</p>
<h3>Rendering a NeRF</h3>
<p>The main step to a NeRF render is raymarching. Cast rays from the camera into a field, sample the NeRF model along these rays, and composite the samples into a final image. And if I want to deform this NeRF, I can represent it with two separate volumes. A "deformed" volume, where the camera resides, and a "canonical" volume, where the static NeRF resides. The camera casts straight rays through the deformed volume, since the resulting deformed NeRF is what I want to view for a given instance of the animation. Then in order to get the density and color of a NeRF at a certain ray sample in the deformed volume, I need a model/method to correlate these samples back to a point in the canonical volume. I will be referring this correlation back to the canonical volume as the "inverse deformation". In essence, I map straight rays from the deformed volume to learned, curved rays within the canonical volume. These curved rays sample the static NeRF, relaying its data to the correlated camera sample in order to render a deformed NeRF.</p>
<p>This inverse deformation problem is a fundamentally simple problem to understand, however it gets substantially more complex when real-time performance is a factor. Some key features for real-time performance of static NeRFs are found in InstantNGP's NeRF implementation: a faster model (compared to previous NeRF implementations), and a volumetric "mask" via a binary voxel-based occupancy grid (determines if the NeRF occupies a particular region in space). The first feature is straightforward to translate to a non-static NeRF, but the occupancy grid is where things could get out of hand. </p>

<h3>Where is the NeRF?</h3>
<p>Obviously, an occupancy grid works fine if the NeRF is static, but it will fail to function once the NeRF deforms. As the NeRF is no longer where the occupancy grid expects it to be. The inverse deformation model could be used at every sample until you reach the static occupancy grid in the canonical volume, but this largely ignores the speed-up of using the occupancy grid as a sampling mask. So why not just create another occupancy grid for the deformed volume? Caching occupancies for all deformed frames can definitely work for a singular deformation or even a short time-tracked animation. All I would need to do is define a frame index I want to render, raymarch with that frame's deformed occupancy grid, and only sample the inverse deformation model and the NeRF when rays pass through this deformed occupancy grid. However, what about expanding past fixed framecount animations? Such as animating based on a combination of various parameters. Anywhere from pose animation with a skeleton to face animation via blendshapes to secondary effects such as muscle or hair simulation based on the more strict parameters. A 128^3 binary occupancy grid may only be 0.2 megabytes, but pre calculating and storing millions of these to handle every scenario would be a poor use of computation power and memory.</p>
<p>So the problem becomes: how can this be avoided? One strategy is to use a simple, direct point-to-point model for the inverse deformation correlation, while a separate, complex model handles predicting the required occupancy grid in real-time. It is an interesting approach that I would like to test in the future, but the implementation I took for this project takes a different approach.</p>

<h2>Implementation of the Inverse Deformation Model</h2>
<p>
Theres two aspects I want to preserve in a real-time focused implementation of an animatable NeRF: keep an occupancy grid for sample masking, and prevent multiple inverse deformation model samples per ray before the occupancy grid is reached. Meaning, I wanted a way to get the resulting full deformed ray with one pass of the inverse deformation model, allowing for much quicker sampling before the occupancy grid is reached. No going between the CPU and GPU after each inverse deformation model sample to see which points should start sampling from the NeRF model and which are still in empty space. Knowing that the ray sampling in the canonical space is a warped/curved version of the ray in the deformed volume, I figured replicating that curve with parametric curves is the way to go.
</p>
<h3>Model Progression</h3>
<p>
Before Hopping directly into a parametric curve implementation, I worked my way through more complex/abstract models. In order of testing, they are:
</p>

<h4>1. Direct Point Correlation Model</h4>
<ul>
  <li>For any point (<em>x</em>, <em>y</em>) in the deformed volume, output the point (<em>x’</em>, <em>y’</em>) that correlates to the correct point in the canonical volume</li>
</ul>
<h4>2. Point Offsets Model</h4>
<ul>
  <li>For any point (<em>x</em>, <em>y</em>) in the deformed volume, output the offsets (<em>i</em>, <em>j</em>),that makes (<em>x + i</em>, <em>y + j</em>) correlate to the correct point in the canonical volume</li>
</ul>
<h4>3. Ray Origin Model</h4>
<ul>
  <li>For a given origin point of a camera ray (<em>x</em>, <em>y</em>) and <em>t</em> for distance along the ray, output the offsets (<em>i</em>, <em>j</em>), that makes (<em>x + i</em>, <em>y + j</em>) correlate to the correct point in the canonical volume</li>
</ul>
<h4>4. Batch Ray Origin Model</h4>
<ul>
  <li>For a given origin point of a camera ray (<em>x</em>, <em>y</em>), output an array of offsets along the entire ray &lt; <em>i</em>, <em>j</em> &gt;, that makes each (<em>x + i</em>, <em>y + j</em>) at a given <em>t</em> distance along the ray correlate to the correct points in the canonical volume</li>
</ul>
<h4>5. Akima Spline Model</h4>
<ul>
  <li>For a given origin point of a camera ray (<em>x</em>, <em>y</em>) and ray angle , output a set of akima spline knot parameters &lt; <em>i</em>, <em>j</em>, <em>t</em> &gt;to define parameterized offset splines for each axis: &lt; <em>i</em>, <em>t</em> &gt; for the <em>x</em> axis, and &lt; <em>j</em>, <em>t</em> &gt; for the <em>y</em> axis. The predicted <em>t</em> values are shared between the <em>x</em> axis spline and <em>y</em> axis spline. Use the <em>t</em> value for the current camera ray sample to perform the standard functions for an Akima spline and combine with the camera sample (<em>x</em>, <em>y</em>) to get the corralated point in the canonical volume</li>
</ul>
<h4>6. Entry/Exit Akima Spline Model</h4>
<ul>
  <li>For a given entry point into the deformed volume’s bounding box (<em>x_entry</em>, <em>y_entry</em>) and a given exit point of the deformed volume’s bounding box (<em>x_exit</em>, <em>y_exit</em>), output the same splines as the Akima Spline Model</li>
</ul>

<h2>Training Environment</h2>
<p><img style="float:center;margin: 10x 0px 10px 0px;" src="/images/ani_nerf_2D_2.gif" /></p>

<p>The majority of the values (deformation parameters, camera position/rotation/resolution) are configurable. The gif above shows how the ray count can be changed on the fly. In fact, since the inverse deformation model only considers one ray in its model size for input/output, you can change the "resolution" without retraining the model</p>
<p>The visualization for the environment is: </p>

<ul>
  <li>The red area is the canonical space</li>
  <li>The green area is the deformed space</li>
  <li>The black dots represent the canonical object when in the canonical space and the correlated deformed points in the deformed space</li>
  <li>The yellow rays represent the camera rays</li>
  <li>The pink dots are the spline knots</li>
  <li>The red dots are the predicted values (calculated using the spline knots)</li>
  <li>The blue dots are the ground truth samples</li>
</ul>

<p>The training data is a jittered sample of camera positions along an orbiting path</p>

<h2>Results</h2>
<p><img style="float:center;margin: 10x 0px 10px 0px;" src="/images/ani_nerf_2D.gif" /></p>
<p>
The model does a decent job tracking the deformation throughout a camera orbit, but there are some noticeable points of improvement. Of course, overall accuracy could improve, but also there is a "wobble" between very small camera movements. I have tried various loss calculation improvements, such as taking into account ray smoothness, ray neighbor consistency, and ray reprojection straightness. They have not resulted in significant improvements for solely the orbital camera path, but more testing needs to be done to find the optimal weightings for these loss functions.
Also, there was not a significant difference between using the camera ray origins versus using the entry/exit points, however I believe this is mostly due to the training data being limited to a mostly-fixed distance orbit around the object. Entry/exit should help the model generalize with free camera movement.
</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A framework for developing and comparing volume deformation models, visualized with a raymarching camera. These models can then be translated to use for real-time animated NeRFs.]]></summary></entry><entry><title type="html">Optimized Centers of Rotation Skinning</title><link href="https://jred1.github.io/OptCoR/" rel="alternate" type="text/html" title="Optimized Centers of Rotation Skinning" /><published>2025-10-17T00:00:00+00:00</published><updated>2025-10-17T00:00:00+00:00</updated><id>https://jred1.github.io/OptCoR</id><content type="html" xml:base="https://jred1.github.io/OptCoR/"><![CDATA[<p><img style="width: 500px;margin: 10px;" src="/images/Opt CoR Comparison.gif" /></p>
<div class="clear">
Implementing the optimized centers of rotation skinning proposed by <a href="https://la.disneyresearch.com/publication/skinning-with-optimized-cors/">this</a> paper in Unity3D. This project includes custom implementations of linear blend skinning (LBS), dual quaternion skinning (DQS), and optimized centers of rotation skinning (Opt. CoR).
</div>

<h2 class="clear">Results</h2>
<p>There are two aspects of performance relavent for this code: pre-processing and real-time.</p>
<h3 id="pre-processing-times">Pre-processing times</h3>
<p>For pre-processing, the bone weights, bone ids, and optimized CoRs are baked into the UVs of the mesh. The runtime of this preprocessing depends on the size of the mesh and number of bones. However, with utilizing a Compute Shader for the compute-heavy task of finding the optimized CoRs for each vertex, the entire baking process takes fractions of a second. During my testing, given my PC, the baking for a fully rigged character took on average 300-350 miliseconds.</p>
<h3 id="real-time-usage-and-bottleneck">Real-time usage and bottleneck</h3>
<p>In a realistic setting, meaning a realistic number of characters with close or medium distance LODs, this alternate skinning is perfectly viable. For example, even with hundreds of characters using Opt. CoR skinning, the game still runs at hundreds of fps. However, compared to the built-in skinning, there certainly is a drop in performance. This is due to a bottleneck created from needing to transfer the bone transform data from the CPU to the GPU every frame. For a future implementation, I would like to explore the use of ECS to alleviate this bottleneck.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Implementing the optimized centers of rotation skinning proposed by this paper in Unity3D. This project includes custom implementations of linear blend skinning (LBS), dual quaternion skinning (DQS), and optimized centers of rotation skinning (Opt. CoR).]]></summary></entry><entry><title type="html">CUDA Cloth Simulation</title><link href="https://jred1.github.io/Cuda-Cloth/" rel="alternate" type="text/html" title="CUDA Cloth Simulation" /><published>2025-10-04T00:00:00+00:00</published><updated>2025-10-04T00:00:00+00:00</updated><id>https://jred1.github.io/Cuda-Cloth</id><content type="html" xml:base="https://jred1.github.io/Cuda-Cloth/"><![CDATA[<video height="250px" autoplay="" controls="" muted="" loop="" playsinline="">
    <source src="/videos/Cloth Sim Video.mp4" type="video/mp4" />
    Your browser does not support the video tag.
  </video>
<div class="clear">
A CUDA-based GPU cloth simulation (mass-spring system) with self collision, object collision, particle pinning, gravity, and wind forces. The goal of this project is developing and optimizing a cloth simulation to be able to adequately run in real-time by utilizing GPU computation.
</div>
<h2 class="clear">Structure</h2>
<p>The basis of the implementation for the cloth simulation used is a mass-spring system. The basic 
construction of a mass-spring system is: a collection of particles (point masses), connected by a series of springs. 
The specifics of how the springs connect to certain masses are 
what allows the system to represent a type of simulation. In the case of a cloth simulation, the masses 
represent each vertex on a cloth mesh. To focus on real-time performance rather than developing a system 
to create a mass-spring system for any mesh, a simple, subdivided plane was used. This simplification 
allows for more directly scalable computation and is still relevant since the subdivided plane can be seen 
as a patch of a more complex mesh. The majority of any piece of cloth will reflect the properties of the patch due to how the 
springs connect to each mass.</p>
<p>
<img style="float:right;height: 150px;margin: 10x 0px 10px 10px;" src="/images/spring connections.png" />
The layout of the springs is portrayed in the figure to the right. Each 
dot represents neighboring particles, and the hollow dot in the center 
represents the particle of interest. If the particle of interest is located at 
coordinates (  i  ,  j  ) in a 2D arrangement of particles,  the springs attached to it 
are as follows:
</p>
<p class="clear">
<ul>
<li>stretch springs (blue): (<em>i</em>, <em>j</em>) connects to each (<em>i</em>+1, <em>j</em>), (<em>i</em>-1, <em>j</em>), (<em>i</em>, <em>j</em>+1), and (<em>i</em>, <em>j</em>-1) </li>
<li>shear springs (green): (<em>i</em>, <em>j</em>) connects to each (<em>i</em>+1, <em>j</em>+1), (<em>i</em>-1, <em>j</em>+1), (<em>i</em>-1, <em>j</em>+1), and (<em>i</em>-1, <em>j</em>-1)</li>
<li>bend springs (red): (<em>i</em>, <em>j</em>) connects to each (<em>i</em>+2, <em>j</em>), (<em>i</em>-2, <em>j</em>), (<em>i</em>, <em>j</em>+2), and (<em>i</em>, <em>j</em>-2)</li>
</ul>
</p>
<p>
 Outside of understanding the basic structure of the system, the particles do not need to actually be referenced in a 2D 
manner. Only referring to the set of particles as a 1D array allows for a more generic and streamlined 
system. However, what does benefit from a 2D arrangement is the collection of springs connecting those 
particles. That is because one spring can theoretically connect to any two particles in a system. As such, 
the layout for the spring collection is a 2D matrix with dimensions <em>n</em> by <em>n</em> where <em>n</em> is equal 
to the total particle count. This results in a sparse matrix in which each occupied element stores the 
information for the spring connecting two different particles. 
</p>
<p><img style="float:center;margin: 10x 0px 10px 0px;" src="/images/sparse matrix.png" /></p>
<p>
An example is seen in the figure above, 
which displays the sparse spring matrix for a mass-spring system with 25 particles (a 5x5 patch). To 
prevent clutter, the filled elements are only showing one piece of data (the rest length of the spring). There 
are a few key features of this sparse matrix that are important to note. 
</p>
<p>
One feature is that the matrix is diagonally symmetric. This is due to the aforementioned facts 
that the matrix’s dimensions are the total particle count in each dimension, and that a spring connects two 
particles. Because of this, if one particle with the ID  x  is connected to another particle with the ID  y  , the 
same spring will be located at both (  x  ,  y  ) and (  y  ,  x  ). 
</p>
<p>
The most important feature to note builds off of the previous feature. Every spring that connects 
to a particle can be found in its respective row or column. That means when calculating the spring forces 
for each particle, only one row (or column) needs to be accessed to find the total spring force directly 
affecting the particle. 
</p>
<p>
However, the sparse matrix is still very wasteful to use directly in computation. With how the 
springs and mesh are defined, there is a maximum of 12 springs per particle, but there may be tens of 
thousands of particles. Meaning only 12 elements per row of tens of thousands of elements are actually 
needed for computation. The maximum spring count could be a bit different if a more irregular mesh was 
used, or if different springs (like sewing springs) were added to the calculation, but it will still be 
significantly less than the total number of particles, especially at the sizes considered for GPU utilization. 
</p>
<p>
Considering how sparse the matrix is, it would be wise to use a sparse matrix storage system. 
Because of the diagonal nature of the array, the first thought may be a DIA format, but doing so would not 
be beneficial. The elements are fairly close to the diagonal when the particle count is low, but they spread 
out as the particle count increases, which is not ideal for DIA. Also accessing per diagonal would mean 
the calculation would result in a scatter implementation rather than a gather implementation. In order to 
keep the main calculation as generic as possible (allowing for any piece of cloth or pieces of cloth as long 
as there is some spring matrix defined for it/them), a compressed sparse row (CSR) format is used. With 
CSR, a single compressed row can be used per particle in the calculation of spring forces. 
</p>
<p>
The main spring system is now fully defined, but there are also some extra features including 
colliders, external forces, and pinning. Most are either single constants or small arrays, so they are not 
worth being thoroughly discussed in this section. 
</p>
<h2>Calculation</h2>
<p>
With the structure of the program defined and initialized, the calculation of the simulation can 
begin. The basis of how the calculation is performed is with explicit integration, which presumes the 
velocity and acceleration of the particles are constant during a time step, and those values are used to 
solve for the next time step. This is an intuitive and fast way to calculate the forces in a simulation but is 
prone to being unstable if the time step is too large. The goal fps of 30 would result in way too large of a 
time step, so the simulation runs several times per frame. The time step chosen for the cloth simulation is 
1000Hz. Thus, the simulation runs 34 times a frame (the last time step in the frame has a shorter length to 
compensate for fractions of steps per frame). 
</p>
<p>
In each time step, there is a loop that runs through each particle in the simulation (this loop is 
replaced by a kernel for the GPU implementation). So, one iteration of the loop represents the calculation 
for the next time step for one particle. The data for the particle of interest gets copied to a local variable 
(since the particle data is considered constant during the time step). The local particle (  p  ) and a local 
variable to keep track of the total force on that particle (  f  ) are the two focus variables. The calculations 
performed on those variables are as follows: 
</p>

<ol>
  <li>Loop through the corresponding row of the CRS and calculate the spring and damping force (per element in that CRS row)
    <ul>
      <li>Spring force using Hooke’s Law</li>
      <li>Damping force: force applied to oppose a fraction of the motion</li>
    </ul>
  </li>
  <li>Add wind/gravity force vector</li>
  <li>Check collision with objects
    <ul>
      <li>Proximity check (per object)</li>
      <li>If colliding, cancel out the velocity and force component in the direction of the object normal</li>
    </ul>
  </li>
  <li>Check collision with the floor
    <ul>
      <li>Proximity check</li>
      <li>If colliding, cancel out the velocity and force in the z direction</li>
    </ul>
  </li>
  <li>Check collision with other particles
    <ul>
      <li>Proximity check (per particle)</li>
      <li>If colliding, apply elastic collision</li>
    </ul>
  </li>
  <li>Use the new force and velocity to find the velocity and position for the next time step and assign it to the local particle.</li>
  <li>Set the corresponding return array element to the updated local particle</li>
</ol>

<h2>Suitability for GPU Acceleration</h2>
<p>
For the GPU implementations, the main focus was the real-time aspect of the code. The creation 
of the CSR has a lot of divergences, is mostly sequential, and only runs once for a short period of time, so 
it would be a poor fit for converting to run on the GPU. On the other hand, everything that is done per 
frame (aside from launching the kernels the correct number of times per frame) can be transferred onto 
the GPU and stay on the GPU. 
</p>
<p>
The explicit integration implementation used for this project is suitable for GPU acceleration 
because as long as the time step is suitably low enough, the calculation for the new particle data only 
needs to rely on its current data and the current data of the particles directly connected to it. Thus, the 
spring forces can be gathered from the connected particles and assigned to an individual particle per 
thread. Also, since only one particle is being written per thread, and the particle array is a 1D array that 
can grow rather quickly with more detailed pieces of cloth, the blocks on the GPU will be filled and 
occupied entirely, except for the very last block for the array.
</p>
<h2>GPU Implementations</h2>
<p>
There were a number of iterative implementations of the GPU version of the code which are all 
described in this section. 
</p>
<p>
In the initial naive implementation, the code was mostly only changed enough to be able to run on 
the GPU. Each thread was assigned a particle, and it would be written to a separate but identical buffer at 
the end of the kernel. To prevent race conditions, the particle at the current index was copied to a register. 
The particle data was the only non-local data that explicitly needed this in order for the kernel to run 
properly, but any global data that was being read multiple times within the kernel was also copied to a 
register. 
</p>
<p>
The next implementation was a small change to see how constants would benefit the execution. 
Every argument that was not being written to was given the const keyword. This was not expected to 
make a huge difference since the only variables able to fit in constant memory were scalar variables and 
the fairly small object array. 
</p>
<p>
The next implementation was focused on seeing how optimizing the math during the computation 
could affect performance. There were a couple of small changes made, but the one expected to make the 
largest difference was the switch from using distance to the squared distance to determine proximity for 
collisions. The square root operation is expensive, so omitting it and squaring the threshold value would 
result in the same comparison being made. 
</p>
<p>
It was unclear if this next implementation would cause a speedup, but it was worth trying. As 
mentioned in the calculation, there are two particle buffers being passed in: the input and the result 
buffers. After the kernel is done running, the result buffer is copied back to the input buffer so the next 
time step has the updated data for the particles. This implementation changed the CPU logic for running 
each kernel per time step. Rather than there being explicitly one input buffer and one result buffer, the two 
buffers swapped roles each time step. For example, after a single time step, the result buffer is holding the 
updated information while the input buffer is now outdated. For the kernel in the next time step, the result 
buffer is passed as the input buffer and the outdated input buffer is passed as the result buffer so the old 
data can be overwritten. This avoids the need to copy back the buffer but adds some CPU logic. 
</p>
<p>
The next implementation was a big shift for the self-collision section of the code. Rather than 
checking the current particle against every other particle for a potential collision, binning was used to 
limit the number of comparisons to only those close to the current particle. This in and of itself changes 
the scaling of the problem since the speed of the self-collision is no longer directly dependent on the total 
particle count. In this step, the update kernel for the bins was kept naive and foolproof to ensure the 
binning was working properly. In the update kernel, each thread was assigned a bin and it checked against 
all the particles to find the particles contained inside the bin. To prevent unnecessary amounts of 
calculation, the update kernel could be launched only every frame rather than every time step, because the 
neighboring particles were unlikely to change significantly between time steps. 
</p>
<p>
The next and final implementation was focused on improving the efficiency of the bin update 
kernel. Rather than the bin checking against all particles, only the particles in the neighboring bins (a 
3x3x3 area surrounding the current bin) were considered. It is very unlikely that a particle will skip an 
entire bin over the course of a frame. The simulation would break before that could happen. So this is a 
way to significantly reduce the number of particles checked, and prevent the bin update kernel from 
scaling based on the total particle count. 
</p>
<h2>Results</h2>
<video style="float:center;" height="250px" autoplay="" controls="" muted="" loop="" playsinline="">
    <source src="/videos/Cloth Sim Video.mp4" type="video/mp4" />
    Your browser does not support the video tag.
  </video>

<h3 id="speedup-over-cpu-at-256k-particles">Speedup Over CPU at 25.6k particles</h3>
<p>Performance metrics were taken with a AMD Ryzen 7 3750H and NVIDIA GTX 1660 Ti<br />
<em>For a more recent benchmark, the “Binned Neighbors” implementation runs at ~5 ms frame time (~200 fps) using a RTX 4070</em></p>
<hr />

<p><img src="/images/performance graph.png" width="500px" /></p>

<table>
  <thead>
    <tr>
      <th>Technique</th>
      <th>Speedup over CPU</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>CPU</td>
      <td>1.00x</td>
    </tr>
    <tr>
      <td>Naive GPU</td>
      <td>108.51x</td>
    </tr>
    <tr>
      <td>Constants</td>
      <td>110.88x</td>
    </tr>
    <tr>
      <td>Math</td>
      <td>154.90x</td>
    </tr>
    <tr>
      <td>Buffer</td>
      <td>154.68x</td>
    </tr>
    <tr>
      <td>Binned Refresh</td>
      <td>156.70x</td>
    </tr>
    <tr>
      <td>Binned Neighbors</td>
      <td>1,635.58x</td>
    </tr>
  </tbody>
</table>

<hr />

<p>
A series of different size meshes were 
used to test each implementation at 
various particle levels. The sizes 5x5, 
10x10, 20x20, 40x40, 80x80, 120x120, 
and 160x160 result in the particle counts of 25, 100, 400, 1600, 6400, 
14400, and 25600 respectively. The graph displays the averaged frame time 
for each implementation as the particle count increases. Log scales are 
being used for a clearer comparison. As expected, the CPU has better 
utilization at very low particle counts; meanwhile, the GPU is being 
underutilized for any of the implementations. However, as the particle 
count increases, the GPU implementations easily beat out the CPU 
implementations. 
</p>

<h2>References</h2>
<div>Stuyck, Tuur. <em>Cloth Simulation for Computer Graphics</em>. Morgan &amp; Claypool Publishers, 2018.</div>
<div>Shiraishi. (2015). <em>simpleGL</em> . Retrieved from 
https://github.com/zchee/cuda-sample/blob/master/2_Graphics/simpleGL</div>]]></content><author><name></name></author><summary type="html"><![CDATA[A CUDA-based GPU cloth simulation (mass-spring system) with self collision, object collision, particle pinning, gravity, and wind forces. The goal of this project is developing and optimizing a cloth simulation to be able to adequately run in real-time by utilizing GPU computation.]]></summary></entry></feed>