Thanks for the reproducer project. I could reproduce the launch failure with the debug target and filed a bug report about it.
The OptiX validation mode threw some warnings which you can easily solve inside your source code, but that didn’t change the behavior.
There were warnings about inlined functions which can be solved by adding the respective __forceinline__ define:
Add CUDA_INLINE to void decodeHitPoint().
Add RT_INLINE to all member functions inside class DirectCallableProgramID and class ContinuationCallableProgramID.
There was a warning about double precision computations inside optix_kernels.cu which came from the double M_PI definition.
Changing that to #define M_PIf 3.14159265f and using that instead will be faster.
There was also a “Warning: Could not instrument function __exception__print for debugging. No suitable insertion point found.” which can be solved by uncommenting your user defined exception program for debug targets.