As you pointed out, always using three function pointers might be quite slow.
Does this also apply for the case with three inlined functions, as you seem to suggest?
With inlined functions there will be no calls. All code is folded into the code at the caller.
I always use
__forceinline__ __device__ for everything which is not a program. I’ve seen cases where the CUDA compiler did not inline calls when using only
__inline__ which is a hint. That happened when either the function body or the number of arguments was above some compiler internal threshold.
Even then, I would not implement three different functions when generating a position, direction, and normal per sample.
Not sure what exactly you’re doing, but lets say you calculate the a sample point on a triangle, then you would need the three vertex positions and a 2D random unit sample which would be converted to barycentrics, then interpolate the vertex positions, transform that into world space if necessary, then the direction from or to that needs that position, which you just calculated and is still in registers most likely, and depending on what normal you calculate, you’d need the vertex positions again for the cross-product calculating the face normal or the vertex normals and barycentrics to interpolate the shading normal. In either case most of that data is already in registers when calculating the position, so it makes a lot of sense reusing that data for all three calculations.
Well, the compiler will probably figure that out after inlining everything and then there are multiple assembler passes optimizing stuff again, but I try helping the compiler with more optimal code as much as possible to start with.
I don’t know what your functions do exactly to say if any of that makes sense. I simply wouldn’t use three sampling functions per geometric primitive. Makes no sense to me.
Is the a general good practice to reduce function calls on device code?
Think of OptiX program domains (rg, is, ah, ch, ms, ex, dc, cc) as real “calls” (that is an overly simple abstraction of the internal scheduling). Everything else should be
__forceinline__ and is not actually a call inside the device code. Other than that, always strive for good code structure in host and device code.
You mention returning the LightSample structure due to performance reasons in the callables example.
Would this also apply for my case, so I would return a struct with 3 float3 for dir/ori/nor?
Thought this was quite costly.
If the functions are inlined it doesn’t matter!
When using direct callable programs, I expect that to be faster than, for example, having three refences or the struct as arguments.
Example code (replace “origin”, “direction”, “normal” with the whole code calculating these.)
// These arguments are four 64 bit pointers and the references are allocated by the caller somewhere in global memory.
extern "C" __device__ void __direct_callable__sampleQuad(const GeometryData::paramQuadData* d, float3& o, float3& d, float3& n)
o = origin; // These are all global memory accesses
d = direction;
n = normal;
// Only two 64-bit pointers as arguments, but the MySample struct has been allocated by the caller and is also not held in registers.
extern "C" __device__ void __direct_callable__sampleQuad(const GeometryData::paramQuadData* d, MySample& sample)
sample.o = origin; // These are all global memory accesses
sample.d = direction;
sample.n = normal;
// Only one 64-bit pointer and the local structure is most likely all registers.
extern "C" __device__ MySample __direct_callable__sampleQuad(const GeometryData::paramQuadData* d)
MySample sample: // This can be held inside registers.
sample.o = origin; // Everything can be in registers here.
sample.d = direction;
sample.n = normal;
I’m not sure if the compiler can also keep the return structure in registers for the final assignment of the struct inside the caller. In any case, the number of arguments and potential optimizations of global memory accesses are worth a try.
Other things to keep in mind when accessing data is, that there are only vectorized load and store instructions for 2- and 4-component vectors in CUDA. Means the float3 above are handled as three individual floats and therefore have an alignment requirement of only 4-bytes. It’s actually faster to load and and store a float4 (alignment requirement 16-bytes) than to load and store a float3.