Recursion of a __device__ function.

I’m trying to call a function like this:

__device__ float calculate_NGEN(char* procedure, float x, float y, float z) {
	PROTO_STRUCT* self = (PROTO_STRUCT*)(procedure);
	switch (self->__type) {
	case 3:
		return AddModule_FUNC(procedure, x, y, z);
	case 4:
		return MultiplyModule_FUNC(procedure, x, y, z);
	case 5:
		return ConstantModule_FUNC(procedure, x, y, z);
	case 6:
		return SelectModule_FUNC(procedure, x, y, z);
	return 0.0;

Each of the *Module_FUNC may or may not call calculate_NGEN. I’ve encountered this error while loading the compiled ptx file:

Unknown error (Details: Function “_rtProgramCreateFromPTXString” caught exception: Compile Error: Found recursive call to _Z14calculate_NGENPcfff)

I couldn’t find much about the error in the documentation. Is recursion not allowed at all ? If that’s the case, what could be a possible workaround ?


Yes, recursions in device code are not supported in OptiX.
The background for that restriction is that it makes it impossible to inline that function or to calculate the required stack space.

A possible solution is to use bindless callable programs for that, which requires that you explicitly tell OptiX 6 the maximum possible callable program recursion depth up-front with rtContextSetMaxCallableProgramDepth().

Function tables like your case can be implemented with buffers of bindless callable program IDs then. No need for the switch-case statement, it’s just a direct index into the buffer.
Example code implementing such function tables (not using recursions) can be found in the OptiX Introduction examples for lens shaders, light sampling, BSDF sampling and evaluation.

Another approach is to store input connections as callable program IDs variables at the program scope itself, which requires individual program objects per graph node. Means your “type” is a variable on the program scope then. This is kind of expensive on the number of program objects.

If that’s about a shader network, using bindless callable programs is the common approach for maximum flexibility, but it’s going to become expensive on the shading side which in turn will limit the maximum performance when actual ray tracing gets cheaper like with Turing RTX GPUs. See this thread:

I understand. I’m working on switching my code to bindless programs. But I’m having trouble with the recursion again:

rtBuffer< rtCallableProgramId<float(char* procedure, float x, float y, float z)> > NGEN_CALLABLE_PROGRAMS;

RT_CALLABLE_PROGRAM float calculate_NGEN(char* procedure, float x, float y, float z) {
	PROTO_STRUCT* self = (PROTO_STRUCT*)(procedure);
	float temp = NGEN_CALLABLE_PROGRAMS[0](procedure, x, y, z);
	return 0.0;

This code throws an exception:

Variable not found (Details: Function “_rtContextValidate” caught exception: Variable “Unresolved reference to variable NGEN_CALLABLE_PROGRAMS from _Z14calculate_NGENPcfff” not found in scope)

The 0th index on the NGEN_CALLABLE_PROGRAMS is calculate_NGEN function itself.

When I take out the 5th line it proceeds as expected without exceptions.

Also another question, can a device function have a call to a bindless program ?


The given code excerpt should have completely removed the call calculating temp as dead code.
Did you try with something which does actual work?
Wait, that shouldn’t happen with a non const pointer argument though.
How did you setup that on the host?

I have not tried that approach myself before. This would require a minimal complete reproducer to analyze.

Also be aware that char pointers have a different CUDA alignment requirement than maybe elements in your PROTO_STRUCT and that reinterpret_cast might result in illegal accesses when the original data is not always perfectly aligned.

Note that you must have --keep-device-functions or --relocatable-device-code=true set in the NVCC or NVRTC (which doesn’t support --keep-device-functions) compile options or bindless callable programs are eliminated as dead code since CUDA 8.0 because there is no call to these functions in the resulting PTX code.

Bindless callable programs can be called from anywhere where you have their ID available. In the end it’s called from some OptiX program domain.
Their variable scope is only the program itself and the context.

It’s recommended to qualify any device function which is not an RT_PROGRAM or RT_CALLABLE_PROGRAM with forceinline device.
I’m using an RT_FUNCTION define for that to make code look consistent.

I set it up like this:

optix::Program calculate_NGEN = context->createProgramFromPTXString(, "calculate_NGEN");
p_space["calculate_NGEN"] = calculate_NGEN;
optix::Buffer buffer = context->createBuffer(RT_BUFFER_INPUT, RT_FORMAT_PROGRAM_ID, 1);

int* buffer_map = (int*) buffer->map(0, RT_BUFFER_MAP_WRITE_DISCARD);
buffer_map[0] = p_space["calculate_NGEN"]->getId();

I do have --keep-device-functions in my arguements, otherwise createProgramFromPTXString fails to find calculate_NGEN and the function is nowhere to be seen in the generated PTX file.

Does this mean that I have to interpret a buffer as it is ? I pass a buffer to my closest hit program like this:

rtBuffer<char> color_graph;

I cast this buffer like this:

NGEN_CALLABLE_PROGRAMS[0](&color_graph[0], n_.x, n_.y, n_.z);

Is there a way to ensure the alignment of that buffer ? I have around 12 different type of structs that reside in a hierarchy in that char buffer. Each struct have their own function and they need to execute depending on the the hierarchy, which may or may not change.I can’t pass the hierarchy in at the compile time. Can I pass a rtBuffer to a function ?

But isn’t it rather odd that I get an unresolved reference if that is the case ?

I did go through the intro tutorials and I saw the definition in rt_function.h . I use device tag on all of my other functions. But could the unresolved reference be caused because the functions are not inlined ?

Also, thank you so much Detlef. Through your other recommendations I got my program working. I am trying to make it even better though, trying to utilize all of the processing power in my card. For reference this is what I’m building:

Its a procedurally generated planet with an adjustable generation graph. The detail increases as you get closer. With my code above I’m trying to generate the textures on the GPU and set them in a buffer so I can increase the vertex complexity and get better shadows. (I’m not passing a normal buffer as of yet, thats why the shadows look like triangles)

Strictly spoken, your buffer handling is not legal in OptiX.

You declare an rtBuffer buffer and use its pointer and then cast it to some struct.

First, OptiX doesn’t actually define pointer arithmetic on buffers. The only legal way to access a single element, of whatever type that is, including user defined structs, is operator.

You cannot pass buffer variables around as argument. Only bindless buffer IDs would allow that.
Getting addresses of rtDeclareVariable variables is not allowed either in OptiX. The rtPayload variable is an exception.

Means you’ve been lucky that the CUDA memory allocations are aligned to what you needed and that OptiX still handles some pointer arithmetic on buffers.

The correct approach would be to define a type as a union of all possible structs you’re casting to, make sure that the alignment of the individual members is correct, e.g. float4 is 16-byte aligned, float2 and pointers are 8-byte aligned, float and float3 are 4-byte aligned, shorts are 2-byte aligned, and char are byte aligned.
Means if you order the struct members by alignement from big to small, they will automatically fall onto the proper offsets.
When using more than one struct in that buffer, make sure the whole structure is padded to the biggest alignment requirement in the struct.
Then use that union type as the buffer’s template type and each function can access it’s own structure elements directly by name from the proper fields.

Similar to what I’m doing with MaterialParameters in the OptiX Introduction examples.
If I wouldn’t modulate the albedo with the texture there, I could have also simply used the parMaterialIndex as argument to the sysSampleBSDF and sysEvalBSDF bindless callable programs later and let them fetch their individual parameters. Then each of them could have a completely different layout in a structured buffer element, just not a different element size.

Similarly your hierarchy of functions could just be indices into that global buffer you’d pass around and each bindless callable can fetch its own parameters. That’s also faster as well because an int argument is only 4 bytes vs. 8 bytes for a pointer.

I’ve seen cases in the past where even an inline defined device function wasn’t inlined by the CUDA compiler because the number of arguments was too high or the function body was too big. It’s just a hint.
That would have been compiled to an actual PTX call instruction and initial OptiX versions didn’t handle that at all and just bailed out. Since then I always used forceinline and never looked back.
OptiX got a lot smarter about that and inlines almost everything itself nowadays.