Optic 7 Passing multiple Ray data to __closesthit__ program

Good Evening,

Got an Optix 7 question regarding multiple Per Ray data being passed from one program (say a __raygen__ program) to another program.

Say I have created two different structs:

struct RayDataRadiance {
float4 result; int depth;
uint4 random;
};

struct RayDataShade {
float attenuation;
};

Suppose in my __closesthit__ program I need both these Ray data values.

Can I retrieve these two different Ray data structs from Optix 7 registers within the single __closesthit__ program? If so, how?

Simple code example if possible would be nice.

Thank you in advance for any help.

Hi @peterSteele123, welcome!

In OptiX 7, you get 8 32-bit register based payload slots to use any way you want. Since pointers are 64 bits, they normally take 2 payload slots. So you can easily pass 4 separate pointers if you like. Or you can pass the data directly the payload slots. It’s up to you how to you want to allocate your payload registers.

In this case, passing your 4 int & float values will also take 4 payload slots. The added benefit of passing the data directly, rather than passing pointers, is you get to avoid the pointer indirections inside your closest-hit program, which could end up saving memory bandwidth.

Another option here is to put your two structs inside of a third container struct and pass a single pointer in 2 payload slots. What is best here depends on your own code structure and how large you expect your payloads to become over time. If you know you’ll eventually have more than 32 bytes in your payload, then using one or more struct pointers is reasonable. If you think you can keep your payload to 32 bytes or less, then our default advice is to try to use the payload slots directly rather than passing any struct pointers at all.

Keep in mind that using more registers can lead to lower performance, so there is a good reason to minimize the number of payload slots you use (and similarly minimize the number and scope of local variables you use as well).

The sample code you should study for passing a struct pointer is the OptiX SDK sample called “optixPathTracer”. It demonstrates how to pass a struct via the payload slots. It doesn’t show 2 structs, but the extension from 1 pointer to 2 pointers is trivial. To see an example of passing payload values directly, you could play with the “optixRaycasting” sample.


David.

Here is a link to the relevant documentation in our Programming Guide:

https://raytracing-docs.nvidia.com/optix7/guide/index.html#device_side_functions#payload-access


David.

Also maybe look into these recent related threads with more links to example code:
https://forums.developer.nvidia.com/t/make-ray-using-optix-7-0-0/158577
https://forums.developer.nvidia.com/t/global-payload/159415
https://forums.developer.nvidia.com/t/mesh-artifacts-when-using-anyhit-for-transparency-optix-7/156600

I would reorder the fields in your struct RayDataRadiance by CUDA data type alignment restrictions.
The way you have it now requires three 32-bit words padding by the compiler behind the int depth to make the uint4 random 16-byte aligned.

To iterate on the explanation from David how to combine both per ray data structures into one and only use only two 32-bit payload registers on optixTrace(), that could look like this:

struct  PerRayData
{
    // 16 byte aligned
    float4 result;
    uint4 random;

    // 8 byte aligned data like float2 would go here.

    // 4 byte aligned
    int depth;
    float attenuation; // Optionally put this here and use the same per ray data for both ray types.
};

Again it’s faster to keep the single float attenuation for the shadow ray separate and have that as local variable around the optixTrace() call for the shadow ray which then has that float encoded into in a single payload register for that ray type only, if you only need that result temporarily to attenuate the lighting result directly after the trace for the shadow ray returns.

1 Like

Thank you @dhart for the reply. The link especially is a great help.

Thanks @droettger for the response. The links are a good resource - the splitPointer and mergePointer are good design.

Another question along these lines.

What if, for argument’s sake, I needed to defined multiple Payload structures and in my __closesthit__ program I had a condition where I would call Ray Trace (optixTrace(…)) with different structures? For instance, in once case I would need to pass a ‘shade’ payload and in another it would be ‘radiance’? Would it be best to build separate splitPointer and mergePointer operations for each structure type?

Thanks,

The splitPointer and mergePointer operations are only there to get a 64 bit pointer value in to and out of 2 32-bit payload slots, nothing more. There’s no reason to customize those operations and duplicate the code for different structures, unless you really want to. The analogous functions in the optixPathtracer sample are slightly more generic, agnostic to the payload structure type, but that means you have to do a little type-casting in your code. Since Detlef’s version is a union of a specific type, he’s avoiding the type casting, but hard-coding the type. Below is the pointer packing code from the optixPathtracer sample.

But – to emphasize what I said earlier, what you might want to do is pack your entire structure into the payload slots, rather than packing a pointer to your structure. Think about this as being conceptually similar to passing your struct by value instead of passing by reference. The reason to do this is because it can be faster than using a pointer (just like accessing a single int passed directly into a function can be faster than passing and then de-referencing a pointer to an int). Which way is faster for you needs to be tested & measured, just be aware that using the payload slots in this “pass by value” fashion is easier for us to optimize, so potentially faster for you. In that case, you would indeed need to use a different packing function for each payload struct type (if you want to write packing functions). In your closest hit, you can unpack the struct locally, which might get optimized away if you’re lucky, or alternatively you could set the individual items directly using calls to optixSetPayload_0,1,2,.... Setting the individual items directly has the potential to be faster than unpacking your struct into memory, but it’s a little more work to manage a shader that is written that way.

static __forceinline__ __device__ void* unpackPointer( unsigned int i0, unsigned int i1 )
{
    const unsigned long long uptr = static_cast<unsigned long long>( i0 ) << 32 | i1;
    void*           ptr = reinterpret_cast<void*>( uptr );
    return ptr;
}

static __forceinline__ __device__ void  packPointer( void* ptr, unsigned int& i0, unsigned int& i1 )
{
    const unsigned long long uptr = reinterpret_cast<unsigned long long>( ptr );
    i0 = uptr >> 32;
    i1 = uptr & 0x00000000ffffffff;
}

static __forceinline__ __device__ RadiancePRD* getPRD()
{
    const unsigned int u0 = optixGetPayload_0();
    const unsigned int u1 = optixGetPayload_1();
    return reinterpret_cast<RadiancePRD*>( unpackPointer( u0, u1 ) );
}

static __forceinline__ __device__ void traceRadiance( ..., RadiancePRD* prd )
{
    unsigned int u0, u1;
    packPointer( prd, u0, u1 );
    optixTrace( ..., u0, u1 );
}


David

Thanks @dhart for the very detailed and professional answer. It is a lot of help for someone learning the technology and I will definitely take your advice in consideration.

Another couple of quick newbie questions, if that is okay.

If I am in an OptiX program (say __closesthit__) and I have a conditional that determines whether I employ a defined payload structure (say RadiancePRD) and later another condition that employs a different payload structure (say ShadePRD), does calling optixTrace( ) effectively synchronize so that the payload structure(s) don’t overwrite? For example (using code analogous to code you presented):

extern “C” void __global__ void __closesthit__ch() {
// do some stuff
if(conditionA == true) {
RadiancePRD *radprd = getPRD();
// do some stuff with radiance
traceRadiance( …, radprd); // does optixTrace synchronize ??
radprd = getPRD();
// do some stuff with radiance result
}
if(conditionB == true) {
ShadePRD *shdprd = getPRD(); // defined for Shade payload
// do some stuff with shade
traceShade( …, shdprd); // does optixTrace synchronize
shdprd = getPRD(); // will this data be consistent with shade or radiance ?
// do some stuff with shade result
}
}// __closesthit_ch()
I apologize for the poor formatting of above code - couldn’t get it to work quite right.

Also, do the optixGetAttributes_0() and optixSetAttributes_0() employ the same register(s) as optixGetPayload_0() and optixSetPayload_0() ? I guess I am concerned that my payload structs will become overwritten.

Once again. Thank you for the detailed and patient response to what I am sure is an easy question for an expert.

The attribute slots and the payload slots are completely independent. You don’t need to worry about which registers they use, they will never overwrite each other. The compiler takes care of register allocation, just like with local variables, you never need to think about it.

In this case it doesn’t matter whether optixTrace synchronizes(*). If you use different payload types conditionally, your payloads cannot be overwritten by other threads or by subsequent trace calls in the same thread (unless you have bugs or confuse which payload type you’re writing to in your own code). The code sample you wrote will work just fine with no problems, there will be no crossover or possibility of getting the wrong payload type, as long as you manage your payload type correctly, and you tell your shaders which payload type you’ve passed. OptiX is a single thread programming model, meaning you aren’t required to consider what other threads are doing, and you don’t need to think about thread synchronization in the CUDA sense.

Think about your code just like you would think about host code. The two optixTrace calls in your sample are completely independent, the scope of the payload slots is limited to the trace call, so they don’t overlap. Just think of the payload slots as arguments to optixTrace; once the call returns, the payload slots belonging to it no longer exist. There is no possibility for your shdprd to be confused for a radprd by OptiX.

(*) If conditionA is evaluated at run-time, then the likely outcome is threads in a warp that evaluate conditionA to false will be idle while other threads run the first block with traceRadiance. There is some synchronizing under the hood intended to improve performance, but that doesn’t affect your payloads in any way. Whatever synchronization happens is not relevant to the question of which payload is being used.

Does that make sense? Maybe we’ve made it look a little scary or confusing at first glance, but I guess the payload usage is easier and safer than what you’re thinking. I’d recommend taking some time to peruse through the programming guide and play with the SDK samples, it will become clearer.


David.

1 Like

Thank you @dhart, now everything makes perfect sense. I was stuck thinking that the OptiX system needed to be synchronized and treated like CUDA code within a given program. What you said cleared it all up.

Thanks David, I couldn’t have said it better. That’s exactly why I did it that way.

You can of course change the union and function arguments from PerRayData* to void* and have exactly the same functionality with more flexibility if you really need to encode pointers to different structures.

All that example code is there to give ideas to learn from and not at all the only way things can be done.

I know that my method of setting the shadow flag inside the PerRayData (PRD) is not the most optimal way in that suite of examples, but that comes from a much more advanced renderer which can calculate cutout opacity from procedural expressions which requires more state stored inside the PRD struct, and then I just left the shadow flag method as it was merely for code continuation reasons, in case I ever improve the example renderer capabilities in the future.