When i started working on a raytracing simulation, I followed the recommendation to use the optix payload values to store PerRayData (PRD). I use 26 payload slots (4xfloat3, 1xfloat4, 4xuint, 1xcuRAND state with 6 uint).
Out of curiosity i implemented the PRD in a local struct in the __raygen_rg program and use two payload values to pass a pointer. To my surprise i found this method to be about 10% more performant.
I use an iterative path tracing approach and simulate in average 150 intersections for 100’000’000 rays.
Do i risk something with this approach? Could i pay a performance penalty on other systems (currently i run the sim on a RTX3090)?
There certainly is always the possibility that the balance could change on different GPU models. I don’t know, but I might speculate that the issue with 26 payload slots is that register pressure maybe gets a little too high and leads to either more spilling and/or lower occupancy than with the pointer version. Since the memory configurations vary across models and architectures, it is plausible that the perf discrepancy of these two options could possibly widen, or narrow, or reverse on a different GPU.
Having both paths around might be reasonable just in case you find reasons to switch or specialize your kernel depending on GPU model. I would also recommend using Nsight Compute to investigate the performance of each setup. If there is additional spilling in the 26 payload slots case, that should be visible in the SASS of your kernel (as additional stores at the beginning, and additional loads near the end), and you should also be able to see it in the memory metrics for the kernel. If my assumptions are wrong, maybe you’ll see the extra time spent somewhere else.
Another thing to consider is whether you can squeeze your payload down do a smaller size. The OptiX samples use a smaller RNG state than cuRAND. If you can tolerate a worse random number generator, that would reduce the payload by 5 I think? Is the 4x3 float a matrix, and can that be constant folded into the scene or anything? Or reduced to a quaternion perhaps, if it’s a rigid body mat? The other 8 slots might be harder, but if you have the option to pack any pairs of float as half2 types, or pairs of ints as 16 bit shorts, that could also save some payload slots.
This balance indeed seems to change even for different examples on my GPU. The scene described above is a sample with a few hundert thousand primitive spheres. For the same scene made with icospheres, the approach with the Payload is slightly faster.
In this case the next step will be to reduce the size of the Payload. Unfortunately, I found the random number generator in the examples to be insufficient. Changing to cuRAND gave me about a 40% performance hit. There might be a possibility to use cuRAND only in the __raygen program and than use the cheaper rng in the __closest_hit programs, but this is a last resort measure.
As for the other payload values:
The float4 vector is the stokes vector storing the polarization state. I might get away with half precision here. The float3 vectors are the ray_origin and ray_dir (used to return the source of the next ray) and scattering_plane (reference frame for the stokes vector).
The unsigned int values can definitely be better packed! Some of them as bool and some as short.
Let’s see how this helps.
Edit: With a reduction to 20 payload slots, the payload approach outperforms the pointer based approach already.
With a little more time on my hand i will investigate with Nsight Compute.
With a reduction to 20 payload slots, the payload approach outperforms the pointer based approach already.
Yes!! Very glad to hear that. I’m keen to hear how much faster you can make it still. ;)
The float3 vectors are the ray_origin and ray_dir (used to return the source of the next ray) and scattering_plane (reference frame for the stokes vector).
Oh! Okay so something I forgot to mention. If you haven’t looked into the payload semantics API, it might be a good time to check. One example is that if you can mark some payload slots as read only, and others as write only, then the compiler can allow those slots to overlap in space (e.g. which registers are used) since they might not intersect in time, if that makes sense. The same might be true if you only need some payload values for closest-hit and others only for, say, miss. Using the payload semantics is another way to potentially reduce the register pressure associated with payload slots.
I found the random number generator in the examples to be insufficient.
Okay, I’m not too surprised. ;) The one in the samples is indeed not high quality. cuRAND does sound expensive though. One possibility might be to use 2 different simple generators, and mix the results?