Ray payload vs payload buffer

Hi all,
My system info:
OptiX Version:[6.1.2] Branch:[r421_00] Build Number:[26511982] ABI Version:[19] CUDA Version:[cuda100] 64-bit. Display driver: 430.26. Ubuntu 18.04.1. gcc 7.4.0. Cuda 10.0. 2x RTX 2080 Ti

I have been checking the recommendations in the GTC talk: https://developer.download.nvidia.com/video/gputechconf/gtc/2019/presentation/s9768-new-features-in-optix-6.pdf

It suggests to minimize payload size. I am simulating electromagnetic propagation, so I use relatively large payloads (around 100 bytes) that may get larger in the future. I do not use recursive tracing, just iterative tracing, so I was wondering whether I get some benefit by replacing the ray payload with a dummy small payload and use a buffer for storing and updating the real payloads on closest hits.

I have tested it, tracing around 6.5 million rays per launch on a small scene (a few small meshes) and I do not get any conclusive results regarding performance, just a slight decrease.

So this is the question, is it advisable to replace large ray payloads with buffers in some cases?

I wouldn’t call 100 bytes a big payload, but you can always minimize the rtPayload variable size to a single 64-bit pointer by defining your payload as a local struct inside the ray generation program and use the pointer to that structure as your only rtPayload variable member.
That incurs an indirection, but it’s the smallest you can get for arbitrary payload sizes.

Make sure to put all structure members onto their native alignment locations to avoid illegal access errors or at least to minimize padding added by the compiler.
I normally order them by alignment size, for example from top to bottom: type (alignment in bytes):
float4 (16) > float2, pointers (8) > float3,float, (4) > short (2) > char (1).
Similarly insert the other data types. See Table 3. Alignment Requirements here:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#vector-types__alignment-requirements-in-device-code

I would not use buffers for that, that would require input_output buffers and you would need to code explicitly for multi-GPU configurations to avoid going over the PCI-E bus to pinned memory for each access, and the amount of required memory is also going to be huge depending on the launch size.

OK. Thanks a lot for the recommendations