Callable program error on some GPUs

I have a callable program that I call from my miss program. On my Tesla K40, it works as expected. However, on my Quadro K4000, I get an RT_EXCEPTION_PROGRAM_ID_INVALID error. I did some digging and it seems that while my buffer of program IDs is populated with non-zero values on the CPU when the buffer is unmapped, by the time it gets to the miss program, the value in the buffer has become zero. It’s as if the buffer’s memory had been overwritten.

I first encountered this problem when I started working with a model that requires much more per-ray memory in OptiX. It doesn’t happen with my other models. I’m aware that some of this memory must be spilling into global memory, but shouldn’t the buffer of program IDs be protected from this?

I just upgraded from OptiX 3.7 to 3.8 beta, and updated my drivers to the latest versions for both machines, and the problem persists.

Here are the specs for the machine it doesn’t work on:
Quadro K4000, Windows 7, CUDA 6.5, OptiX 3.8 beta, driver 347.88
And the machine on which it works just fine:
2x Tesla K40, Windows 7, CUDA 6.5, OptiX 3.8 beta, driver 341.44

How do you implement different per ray payload sizes per model? There is no dynamic allocation possible.
I’m assuming you’re allocating a fixed amount and some models fill more data in a fixed sized array or the like.

What’s the size of your per ray payload in the failing case?
What stack size did you set?

Could there simply be an issue with the required memory if this only fails with models which need more data?
The Quadro K4000 has “only” 3GB where the Tesla K40 has 12GB.

Could you make sure that your program is never writing out of bounds or run into exceptions before this happens?
To do that enable all OptiX exceptions. By default only the stack overflow exception is enabled.
Then you could add user exceptions which throw when your per ray data access would read or wite memory out of bounds, e.g check all array indices. It’s like an assert you can implement with a user defined exception in OptiX and different values to identify the root cause which can be decoded and printed inside the exception program.

If this is not detecting any defects, to isolate if this is due to the different drivers, you could install the same 341.44 driver on the Quadro K4000 as well.
It can be found here http://www.nvidia.com/Download/Find.aspx?lang=en-us

I originally implemented a large fixed-sized array in the per-ray data. However, I’ve since changed to allocating scratch space in an rtBuffer (see Variable length ray payload). This leads to better performance on my K40, but granted, it doesn’t actually reduce my memory requirement.

Edit: I should clarify that the problem started when introduced the large fixed-sized array, but I also introduced a new set of input data at the same time which required the extra memory. The problem persists with that set of input data now that I’ve switched to allocated scratch space.

Was 636 bytes, now 44 bytes after the change mentioned above.

Was 16384 bytes, now 4096 bytes after the change mentioned above.

I think I’m well under this limit, but I will run some profiling to check this out.

I have all exceptions enabled. How can I tell when I’m out of bounds (other than being outside of my scratch rtBuffer)?

Previously both machines had driver 340 installed, but I updated both to the latest to see if that would fix the problem.

I ran Nsight to trace the application. On my Quadro K4000, the Device Memory Allocated is 1E+08 (I assume bytes). On the machine with 2x Tesla K40, it is just under 5E+07. These seem to be within the 3GB limit.

By asserting manually that you’re inside the arrays and structs you’re writing and reading.

// Check addresses or indices for out of bounds access.
if (address < your_data_begin || your_data_end < address)
{
  rtThrow(RT_EXCEPTION_USER + 1); // Your exception code for an out of bounds access.
}
RT_PROGRAM void exception()
{
  const unsigned int code = rtGetExceptionCode();
  rtPrintf("Exception code 0x%X at (%d, %d)\n", code, launchIndex.x, launchIndex.y); // DEBUG ONLY
}

Other than that, there is not much to analyze with the given information.
If things work with smaller models and and fail with bigger models and only on one configuration, I would start pruning down the reproducer to the absolute minimum (model size and program code) and see if there is any error on the way.