Function pointers in rg()

Hello!
I couldn’t find anything regarding this topic on the OptiX forum and I am not even sure, if this is the smartes way to do this.
I am using Optix 7.1 with CUDA 11.2.

My problem:
For each primitive in the scene, I launch a rg() to sample the surface of these different (custom) primitives.
To avoid having to use switch/case and reduce comparisons in the loop, I wanted to pass a structure with the corresponding pointers to the sampling functions.

in rg():

typedef float3(* Ori) (float, float, const GeometryData::paramQuadData* d);
typedef float3(* Dir) (float, float, const float3*, const GeometryData::paramQuadData* d);
typedef float3(* Nor) (const float3*, const GeometryData::paramQuadData* d);

typedef struct fun {
Ori ori;
Nor nor;
Dir dir;
};

switch(data->type) {
    case (quadType::disc):
		sample.ori = &sampleDiscPoint;
		sample.dir = &sampleDiscDir;
		sample.nor = &normalFromPointDisc;
		break;

	case (quadType::sphere):
		sample.ori = &sampleSpherePoint;
		sample.dir = &sampleSphereDir;
		sample.nor = &normalFromPointSphere;
        break;
		}
for(n rays)
{
sample.ori(vals);
sample.dir(vals);
sample.nor(vals);
}

my sample functions are declared as:
__device__ and contain a call to an inline function

Sadly CUDA complains about a broken PTX file and it seems, that function pointers on the GPU come with quite an overhead.
part of the error msg:
Taking the address of functions is illegal because indirect function calls are illegal. Address of function_Z15sampleDiscPointffPKN12GeometryData13paramQuadDataE is taken and used in function __raygen__therm
Is there a smarter way to switch to the correct funtions, without having to write a seperate loop for each type?

Thank you for your help in advance!
Kind regards,
Martin

The error message says it all. This way of function pointer handling is not supported in OptiX.

If you would like something like that you could use callable functions.
OptiX offers direct callable and continuation callable functions. The difference is that the latter allows calling optixTrace.
Please search the OptiX 7 programming guide and this forum for callable:
https://raytracing-docs.nvidia.com/optix7/guide/index.html

The caveat is that callable functions add, uhm, calling overhead. Means the original switch-case code with inlined functions is most likely faster.

Now, before going into details with direct callable functions, which would be suitable for your case, I’m questioning the partitioning of your geometry sampling into three functions. I see no necessity for that. Normally the data required to sample position, direction and normal are not independent of each other and can be calculated in a single function. (I’m thinking of a light sampling function for example or a surface sample for light baking.)

I’m using direct callable functions in my OptiX examples for most compact and elegant code, but as said that is not the highest performance option, especially not in your case when calling three functions where one should be enough.

The following links would be an example of generating a light sample point for different light types.
The last one __direct_callable__light_parallelogram would be the most interesting one for you since that samples a custom primitive. (Inside the scene the geometry of that is represented as two triangles though.)
There is no problem of adding more light types like this.

Explicit light sample functions as direct callables:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/shaders/light_sample.cu
optixDirectCall() of those inside the closest_hit program:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/shaders/closesthit.cu#L265
Setup on host side (go up from there to see program and pipeline generation)
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/src/Device.cpp#L880

For performance reasons, these light sampling routines do not take the LightSample structure as pointer/reference but return it to let the compiler keep more values in registers when possible.

There are more of these for example to switch between different lens shaders inside the ray generation program:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/nvlink_shared/shaders/raygeneration.cu#L233

Again, I would bite the bullet and implement your case as inlined function (singular!) per geometric shapes first and then build the same application with direct callable programs and compare performance. Elegant code is not everything.
The results would also depend on what else happens inside your ray generation program, esp. how many calls to optixTrace are inside it. The goal is to have the least amount of optixTrace calls inside the code and at the outer-most scope.

Thanks for your reply.
I will use individual loops for each primtive type, as it is the easiest to implement and then (if time allows) try it with direct callables.

As you pointed out, always using three function pointers might be quite slow. Does this also apply for the case with three inlined functions, as you seem to suggest?
Is the a general good practice to reduce function calls on device code?

I will rewrite the sampling routines into one single function.
You mention returning the LightSample structure due to performance reasons in the callables example.
Would this also apply for my case, so I would return a struct with 3 float3 for dir/ori/nor?
Thought this was quite costly.

Kind regards,
Martin

As you pointed out, always using three function pointers might be quite slow.
Does this also apply for the case with three inlined functions, as you seem to suggest?

With inlined functions there will be no calls. All code is folded into the code at the caller.
I always use __forceinline__ __device__ for everything which is not a program. I’ve seen cases where the CUDA compiler did not inline calls when using only __inline__ which is a hint. That happened when either the function body or the number of arguments was above some compiler internal threshold.

Even then, I would not implement three different functions when generating a position, direction, and normal per sample.
Not sure what exactly you’re doing, but lets say you calculate the a sample point on a triangle, then you would need the three vertex positions and a 2D random unit sample which would be converted to barycentrics, then interpolate the vertex positions, transform that into world space if necessary, then the direction from or to that needs that position, which you just calculated and is still in registers most likely, and depending on what normal you calculate, you’d need the vertex positions again for the cross-product calculating the face normal or the vertex normals and barycentrics to interpolate the shading normal. In either case most of that data is already in registers when calculating the position, so it makes a lot of sense reusing that data for all three calculations.

Well, the compiler will probably figure that out after inlining everything and then there are multiple assembler passes optimizing stuff again, but I try helping the compiler with more optimal code as much as possible to start with.

I don’t know what your functions do exactly to say if any of that makes sense. I simply wouldn’t use three sampling functions per geometric primitive. Makes no sense to me.

Is the a general good practice to reduce function calls on device code?

Think of OptiX program domains (rg, is, ah, ch, ms, ex, dc, cc) as real “calls” (that is an overly simple abstraction of the internal scheduling). Everything else should be __forceinline__ and is not actually a call inside the device code. Other than that, always strive for good code structure in host and device code.

You mention returning the LightSample structure due to performance reasons in the callables example.
Would this also apply for my case, so I would return a struct with 3 float3 for dir/ori/nor?
Thought this was quite costly.

If the functions are inlined it doesn’t matter!

When using direct callable programs, I expect that to be faster than, for example, having three refences or the struct as arguments.
Example code (replace “origin”, “direction”, “normal” with the whole code calculating these.)

struct MySample
{
  float3 o;
  float3 d;
  float3 n;
};

// These arguments are four 64 bit pointers and the references are allocated by the caller somewhere in global memory.
extern "C" __device__ void __direct_callable__sampleQuad(const GeometryData::paramQuadData* d, float3& o, float3& d, float3& n) 
{
   o = origin; // These are all global memory accesses
   d = direction; 
   n = normal;
}

// Only two 64-bit pointers as arguments, but the MySample struct has been allocated by the caller and is also not held in registers.
extern "C" __device__ void __direct_callable__sampleQuad(const GeometryData::paramQuadData* d, MySample& sample)
{
  sample.o = origin; // These are all global memory accesses
  sample.d = direction;
  sample.n = normal;
}

// Only one 64-bit pointer and the local structure is most likely all registers. 
extern "C" __device__ MySample __direct_callable__sampleQuad(const GeometryData::paramQuadData* d)
{
  MySample sample: // This can be held inside registers.

  sample.o = origin; // Everything can be in registers here.
  sample.d = direction;
  sample.n = normal;

  returns sample;
}

I’m not sure if the compiler can also keep the return structure in registers for the final assignment of the struct inside the caller. In any case, the number of arguments and potential optimizations of global memory accesses are worth a try.

Other things to keep in mind when accessing data is, that there are only vectorized load and store instructions for 2- and 4-component vectors in CUDA. Means the float3 above are handled as three individual floats and therefore have an alignment requirement of only 4-bytes. It’s actually faster to load and and store a float4 (alignment requirement 16-bytes) than to load and store a float3.