Global payload

Hello all,

I am building some PTX shaders to be employed by Optix 7 and am trying to build an __anyhit__ function that sets an attenuation value to zero and then terminates the Ray. The code I have follows:
system_param.h:

struct PerRayDataShade {
  float attenuation;
};
struct SystemParam {
  PerRayDataShade prdShade;
};

myShader.cu:

#include "system_param.h"
extern "C" __device__ SystemParam param;`
extern "C" __global__ void __anyhit_shadow() {
   param.prdShade.attenuation = 0.0f;
   optixTerminateRay();
 }

The code works but I don’t trust the setting of attenuation to zero in the ‘__anyhit_shadow()’ function. Can anyone tell me the more standard way of executing this type of operation using Optix 7?

The standard way of initializing your ray payload is to do it before calling optixTrace(). For this case, where you’re asking about shadow rays, you can put the initialization wherever the shadow ray is cast, it might be in a closest-hit shader, or in raygen. For primary rays, payload initialization would always go in the raygen program immediately prior to calling optixTrace().

Any-hit will be called before closest-hit, so your code might work for you in that sense. But, don’t forget that any-hit shaders are not expected to be called in depth order, so if you are initializing something in your payload for use in any-hit shaders, it won’t work. Also calling optixTerminateRay() can work for opaque shadows, but if you need to accumulate your attenuation with transparent objects, you will need something more sophisticated in order to sort your any-hit intersections in depth order and composite them properly.

Bigger picture here is that if you are casting shadow rays against RTX enabled triangle meshes, you should consider disabling your any-hit shader completely (OPTIX_GEOMETRY_FLAG_DISABLE_ANYHIT or OPTIX_INSTANCE_FLAG_DISABLE_ANYHIT or OPTIX_RAY_FLAG_DISABLE_ANYHIT), and instead use closest-hit shaders and the the flag OPTIX_RAY_FLAG_TERMINATE_ON_FIRST_HIT.

It used to be the case in OptiX 5 and earlier, before there was hardware acceleration, that using any-hit programs for shadows and terminating the ray was the most performant way to do shadowing. With the hardware acceleration, now the recommended approach is to avoid using any-hit programs completely, if you can.

–
David.

Thank you @dhart for the very informative post. I will do what I can to avoid any-hit programs for performance sake.

1 Like

To be clear, the single declaration of extern "C" __device__ SystemParam param; with the PerRayDataShade struct inside is not going to function at all because that completely ignores the fact that there are thousands of CUDA threads running simultaneously which would all hammer onto that same memory which breaks your intended per-ray attenuation value.

If you need anything to happen per ray, it needs to be controlled via the payload registers inside the optixTrace calls.

That also means there isn’t actually a need to have a payload structure with a single float value in your case, but you should just use one of the maximum 8 payload registers which would be faster than encoding a 64-bit pointer to some local payload structure memory.
You would write to that register with the optixSetPayload_0 device function (or whichever register index you use) before calling optixTerminateRay() inside your anyhit program.

But as David explained, if you do not have any geometry which needs to call optixIgnoreIntersection() in an anyhit program for cutout opacity etc., then the shadow/visibility ray doesn’t need an anyhit program at all, because you can handle that quicker in OptiX 7 by using the aforementioned instance and ray flags.

Check the traceOcclusion() functions used in various OptiX SDK 7.2.0 examples.

Thank you for the information @droettger, it is very helpful. I will lose the extern "C" __device__ SystemParam param; call.

I found the following online:

extern "C" __global__ void __miss__ms() {
  MissData* miss_data =  reinterpret_cast<MissData*>( optixGetSbtDataPointer() );
  setPayload( miss_data->bg_color );
}

Is this an example of passing data to a program during OptiX execution?

Thank you again for all the assistance to a newbie.

Yes, as long as you can fit your payload into the available 8 unsigned int payload registers, use them.
See these device functions: optixTrace() overloads, optixSetPayload_*(), optixGetPayload_*()

I found the following online

That’s literally code from the OptiX SDK examples.
I would recommend to do a “Find in Files” over all *.h; *.cpp; *.cu source files with your favorite text editor inside the OptiX SDK, and then read the documentation about the OptiX device functions used inside the CUDA code.

If you need more data you would need to declare a local payload structure inside your ray generation program and encode the 64-bit pointer to it into two of the payload registers.
Here is my ray generation program using split and merge pointers
(The SDK example use a different implementation, search for packPointer and unpackPointer.)

Even then it still makes sense to keep the most frequently accessed fields in that inside payload registers for performance reasons. Avoid as many memory accesses as you can.
See this thread for related comments.

1 Like

Thanks again for all the assistance @droettger. I really like your split and merge operations.

I assume the following snippet would be okay with regards to properly using these kernels from inside an OptiX program (e.g. __raygen_rg()):

float4 totalResult = make_float4(0.0f, 0.0f, 0.0f, 0.0f);
// do some stuff
PerRayData prd;  // defined per-ray data struct 
uint2 payload = splitPointer(&prd);
optixTrace(..., payload.x, payload.y);
totalResult += prd.result;  // where result is field in prd

Thanks again.

Yes, that’s exactly how it works.

Similar in my raygeneration.cu files if you look for radiance and prd.radiance.

I use my vector_math.h header which contains some convenient overloads for the CUDA vector types which allow to write this shorter:
float4 totalResult = make_float4(0.0f);
The original vec_math.h inside the OptiX SDK does that as well, but I think the lerp(), bilerp(), and roundup() functions do not belong into the initial #ifdef(__CUDACC__) section, but only the min() and max() functions. That’s why I have my own header which added some more overloads as well.

Cool.

Thanks again @droettger