PerRayData in local struct more performant than in OptiX payload

rafael.ottersberg · November 6, 2024, 11:54am

Hi everyone

When i started working on a raytracing simulation, I followed the recommendation to use the optix payload values to store PerRayData (PRD). I use 26 payload slots (4xfloat3, 1xfloat4, 4xuint, 1xcuRAND state with 6 uint).

Out of curiosity i implemented the PRD in a local struct in the __raygen_rg program and use two payload values to pass a pointer. To my surprise i found this method to be about 10% more performant.

I use an iterative path tracing approach and simulate in average 150 intersections for 100’000’000 rays.

Do i risk something with this approach? Could i pay a performance penalty on other systems (currently i run the sim on a RTX3090)?

Best,
Rafael

dhart · November 6, 2024, 6:46pm

Hi @rafael.ottersberg,

There certainly is always the possibility that the balance could change on different GPU models. I don’t know, but I might speculate that the issue with 26 payload slots is that register pressure maybe gets a little too high and leads to either more spilling and/or lower occupancy than with the pointer version. Since the memory configurations vary across models and architectures, it is plausible that the perf discrepancy of these two options could possibly widen, or narrow, or reverse on a different GPU.

Having both paths around might be reasonable just in case you find reasons to switch or specialize your kernel depending on GPU model. I would also recommend using Nsight Compute to investigate the performance of each setup. If there is additional spilling in the 26 payload slots case, that should be visible in the SASS of your kernel (as additional stores at the beginning, and additional loads near the end), and you should also be able to see it in the memory metrics for the kernel. If my assumptions are wrong, maybe you’ll see the extra time spent somewhere else.

Another thing to consider is whether you can squeeze your payload down do a smaller size. The OptiX samples use a smaller RNG state than cuRAND. If you can tolerate a worse random number generator, that would reduce the payload by 5 I think? Is the 4x3 float a matrix, and can that be constant folded into the scene or anything? Or reduced to a quaternion perhaps, if it’s a rigid body mat? The other 8 slots might be harder, but if you have the option to pack any pairs of float as half2 types, or pairs of ints as 16 bit shorts, that could also save some payload slots.

–
David.

rafael.ottersberg · November 6, 2024, 8:45pm

Hi David

Thanks for your suggestions!

This balance indeed seems to change even for different examples on my GPU. The scene described above is a sample with a few hundert thousand primitive spheres. For the same scene made with icospheres, the approach with the Payload is slightly faster.

In this case the next step will be to reduce the size of the Payload. Unfortunately, I found the random number generator in the examples to be insufficient. Changing to cuRAND gave me about a 40% performance hit. There might be a possibility to use cuRAND only in the __raygen program and than use the cheaper rng in the __closest_hit programs, but this is a last resort measure.
As for the other payload values:
The float4 vector is the stokes vector storing the polarization state. I might get away with half precision here. The float3 vectors are the ray_origin and ray_dir (used to return the source of the next ray) and scattering_plane (reference frame for the stokes vector).

The unsigned int values can definitely be better packed! Some of them as bool and some as short.

Let’s see how this helps.
Edit: With a reduction to 20 payload slots, the payload approach outperforms the pointer based approach already.

With a little more time on my hand i will investigate with Nsight Compute.

Best,
Rafael

dhart · November 6, 2024, 11:18pm

With a reduction to 20 payload slots, the payload approach outperforms the pointer based approach already.

Yes!! Very glad to hear that. I’m keen to hear how much faster you can make it still. ;)

The float3 vectors are the ray_origin and ray_dir (used to return the source of the next ray) and scattering_plane (reference frame for the stokes vector).

Oh! Okay so something I forgot to mention. If you haven’t looked into the payload semantics API, it might be a good time to check. One example is that if you can mark some payload slots as read only, and others as write only, then the compiler can allow those slots to overlap in space (e.g. which registers are used) since they might not intersect in time, if that makes sense. The same might be true if you only need some payload values for closest-hit and others only for, say, miss. Using the payload semantics is another way to potentially reduce the register pressure associated with payload slots.

https://raytracing-docs.nvidia.com/optix8/guide/index.html#payload#payload

I found the random number generator in the examples to be insufficient.

Okay, I’m not too surprised. ;) The one in the samples is indeed not high quality. cuRAND does sound expensive though. One possibility might be to use 2 different simple generators, and mix the results?

–
David.

Topic		Replies	Views
Ray payload vs payload buffer OptiX	3	970	June 14, 2022
Optic 7 Passing multiple Ray data to __closesthit__ program OptiX	12	2073	October 12, 2021
Data of individual ray gets modified unintentionally OptiX cuda , optix	3	655	December 11, 2023
Two questions: 1. payloadtype semantics 2. ray-triangle intersection OptiX	11	845	September 13, 2022
rtPayload size fixed? Crashes when changing payload size OptiX	9	1665	June 14, 2022
Struct of vectors instead of vector of structs in Optix API OptiX	6	1650	June 14, 2022
Optix 7.0: Payload data set using optixSetPayload_x() lost if anyhit program calls optixIgnoreIntersection() OptiX	6	1270	June 14, 2022
Variable length ray payload OptiX	11	2438	June 14, 2022
Make_ray using optix 7.0.0 OptiX	14	1216	June 14, 2022
Help reduce the high register count of an Optix raytracer code OptiX	12	1613	August 18, 2022

PerRayData in local struct more performant than in OptiX payload

Related topics