Can particle effects be created with CUDA?


I have been tasked at work with finding the best way to do multiple complex real-time particle effects to simulate explosions. I really know very little about CUDA at present, and I was wondering if this is the type thing that CUDA would be well-suited for doing. I haven’t really come across anything that would affirm or deny the concept, so I was hoping that someone could help me.


Manipulating particles is inherently very parallel, so I’d say CUDA is very suited to do it. Every thread would do a particle, for example.

Yes, CUDA is well suited to performing particle simulation, in fact there will be a couple of examples of this in the next release of the SDK.

Definitely CUDA. My colleague got a 10x speed up after switching from geometry shader to CUDA.

Thanks for the replies.

Forgive me, but I’m still trying to switch gears so-to-speak from traditional OpenGL programming. For instance, wumpus states that every thread could do a particle. I’ve read the the documentation and at least conceptually understand what is going on, I am having trouble grasping how one would go about assigning a particle to a thread.

Also, when a thread is computing a particle’s position, is the thread acting on an existing primitive, or is it solely working with positional data? If the former, to what degree are primitives supported? If the latter, is the CUDA kernel responsible for doing all the 3D matrix operations to determine the particles position and/or whether it has been clipped, and if it is responsible for calculating this information, how is that information later translated into something (vertex list, vertex buffer object, etc.) that OpenGL can use?


Since CUDA is more-or-less just C, you write code that operates on arrays frequently. You could, for example, make an array of particle positions and trajectories. A really, really simple example would be:

#define THREAD_PER_BLOCK 128

__global__ void propagate(float delta_t, float3 *positions, float3 *velocities)


  int offset = blockIdx.x * THREAD_PER_BLOCK + threadIdx.x;

 //  Copy particle properties from global memory

  float3 pos = positions[offset];

  float3 vel = velocities[offset];

 // Compute new position (you could overload the operators * and + to make this shorter)

  pos.x += delta_t * vel.x;

  pos.y += delta_t * vel.y;

  pos.z += delta_t * vel.z;

 // Store new position back to global memory

  positions[offset] = pos;


When you execute this kernel, the assignment of particles to particular threads is implicit in the calculation of offset. Each thread sees a different combination of blockIdx and threadIdx, which is how you can tell them apart and allow each one to operate on different data.

Unfortunately, this doesn’t quite apply to your second question about OpenGL interoperability. To move data between OpenGL and CUDA contexts without copying back through the host requires some initial setup. See section in the programming guide for some description of that. (I don’t know anything about the OpenGL stuff…)

Thanks for the code example. Things are becoming clearer.

In regard to the code, obviously the position and velocity arrays would be of the same size. What is unclear to me is the size restrictions (if any) that are placed on certain types of memory. I am assuming that the memory storage for the two arrays would be global memory. Is this memory allocated on the GPU or is it host memory? If it is host memory, is there a large performance hit when passing it to the GPU?

Also, I am aware that when calling this function, the following format is used:

propagate <<< Dg, Db, Ns >>> (time, pPosition, pVelocity);

where Dg is the number of blocks, Db is the number of threads per block, and Ns is the number of bytes in shared memory that is dynamically allocated (typically 0). What isn’t clear is how these numbers are chosen.

If I had 5000 positions and velocities to be calculated, would I be correct in assuming that I would need

5000 / 128 threads per blocks

which equals 40 blocks after rounding up? Thus, the function would be written

propagate <<< 40, 128, 0 >>> (time, pPosition, pVelocity);

I’ve seen a lot mentioned in the documentation about bank conflicts. In the above scenario, would this cause many bank conflicts?


Global memory is on device, and is pretty fast if you read it coalesced.
For the # of threads and # of blocks, yes, just 40,128.
Bank conflict part only applies to shared memory, and you shouldn’t pay too much attention to it. Coalescing is orders of magnitude more important.

Be sure to note that Dg and Db are of type dim3 (which is a struct with x,y,z integer fields) and not just integers. You don’t need to use all three dimensions, of course, and can leave y=z=1 for a one-dimensional problem. (The dimensionality of the block and the grid is just an accounting convenience for you. Many problems naturally break up into 2D or 3D computations.)

Selecting the right size of Dg and Db is a tricky thing. The sizes are sometimes constrained by the algorithm you are trying to implement. Threads in the same block can collaborate via fast shared memory, whereas direct communication between blocks should be avoided.

There are some heuristics mentioned in the guide and other places:

  • You want the number of threads per block to be a multiple of 32, since this is the size of a “warp,” the unit of parallel execution on a multiprocessor.

  • If possible, making the number of threads per block be a multiple of 64 helps the instruction scheduler more.

  • You want as many blocks as there are multiprocessors on the GPU. For the GTX, this number is 16, but future products will have more.

  • If possible, you want many more blocks than just 16 to increase multiprocessor utilization. If the register and shared memory usage of your block is low enough (less than half of the available resources on a multiprocessor), then multiple blocks can execute on the same multiprocessor. This allows one block to do some arithmetic while another is waiting for reads from global memory, which can take 200-300 clock cycles.

But probably the best advice is to design your program so that the block and grid size are adjustable constants so you can benchmark different configurations.

I no doubt have a lot to learn regarding the most efficient methods of writing a CUDA program, but I want to thank you for all the help. A lot of things have been cleared up for me.

One thing that I’ve read which isn’t made entirely clear in the documentation is this concept of “coalesced memory.” The docs refer to making sure that memory access is coalesced for more efficient programs, but it doesn’t (that I have found) go into much detail as to what “coalesced” means or how to go about implementing a method to utilize the concept. Can you please help me out with this?


Coalescing will be done if

  • Each thread writes or reads a 4, 8 or 16 (SIZE) byte value
  • AND thread n within a half-warp (16 threads) writes data[n]
  • AND data is aligned to 16*SIZE in device memory

Usually, you do this by having each thread read or write a 4 byte integer or floating point value, making sure your data structure and strides are suitibly aligned.

The book “GPU Gems 3” includes a coded example of an N-Body simulation, completed with Visual Studio project. It’s just a matter of copying the files to your PC; building the project; and then running the executable.