Best Approach for Collecting Particle Flux on a Triangulated Surface?

Hi there,

I’m currently developing an OptiX application to compute the flux of particles onto a triangulated surface using ray tracing and Monte Carlo sampling. The core idea is to simulate particles generated randomly on a source plane, trace them to the surface, and assign some weight to the triangle they hit. Afterward, the particle’s weight is reduced, and it is reflected off the surface. This process repeats until the particle’s weight falls below a defined threshold or it misses the surface entirely.

So far, I’ve successfully implemented this approach using a global GPU buffer with an entry for each surface triangle. In the __closesthit__ shader, I utilize the atomicAdd function to increment the corresponding entry of the hit triangle. Here’s a minimal example of the __closesthit__ program:

void __closesthit__particle()
{
    PerRayData *prd = (PerRayData *)getPRD<PerRayData>();

    const unsigned int primID = optixGetPrimitiveIndex();
    atomicAdd(&params.resultBuffer[primID], prd->rayWeight);
    prd->rayWeight -= prd->rayWeight * params.sticking;
    diffuseReflection(prd);
}

While researching this topic, I came across a similar idea in this forum: accumulating particle hits in a per-particle buffer and performing a reduction in a separate kernel to avoid relying on atomicAdd.

This leads me to my question:

  • Is using atomicAdd a poor choice for this kind of task due to performance concerns or potential drawbacks in scalability?
  • Would a per-particle buffer with subsequent reduction make more sense in this scenario?
  • Are there other efficient methods to avoid or optimize the use of atomicAdd in this context?

Thank you in advance for your insights!

Best regards,
Tobias

Hi @tob.re, and welcome.

Interesting problem, similar to ray tracing from light sources. Anyways:

atomicAdd() is a bad idea if the performance is poor compared to your system’s peak performance, but you need to measure that and then decide what to do (in general: allocate memory for each thread, do a reduction, use warp-level primitives, …).

A quick way to measure peak perfo is to non-atomically add and generate wrong results.

Memory: if only a few tris are hit, allocating data for all of them might be wasteful, but I assume that’s no issue now.

Given your description, calling diffuseReflection() also when the weight is below some threshold looks useless.