extern "C" __global__ void __closesthit__radiance() {
// lot of matrix multiplication and addition which can benefit from parallel execution
// compute bsdf and set radiance
prd->radiance += light.emission * weight;
}

I have written the following functions which perform the matrix operations in a brute force way (nested for loops) and it takes around 20 seconds to render a single frame using optixPathtracer example available in Optix 7.3 SDK (just for reference the starter code without these matrix operations takes 50 ms per frame) .

static __forceinline__ __device__ Mat mat_add(Mat A, Mat B)
static __forceinline__ __device__ Mat mat_mul(Mat A, Mat B)

To improve upon this, I tried launching the following â€śnestedâ€ť cuda kernel within the CH program but that gave the same error as in this post.

I also read about the callables mentioned in the same post and was wondering if they can help with this usecase. I am assuming callables will not help with this because they cannot facilitate efficient matrix operations. I am still learning Optix, please correct me if I am mistaken about something.

You simply cannot launch native CUDA kernels inside OptiX device code.

An OptiX launch is executing a kernel already and the arguments to the launch call define the dimension of that kernel (the number of threads) and all scheduling is completely internal to OptiX.

OptiX is using a single ray programming model. You can of course multiply matrices inside OptiX device code, and things will happen in parallel depending on the launch dimensions.

What I think happened in your case to result in 20 seconds per frame is that you multiplied all matrices in all threads instead of each matrix on one thread (that means per launch index).

Itâ€™s unclear what matrices youâ€™re trying to apply there.

Usually youâ€™d need to transform object coordinates to world coordinates inside the closest hit program which is always required when there are any non-identity transformations inside the transform list above a hit primitive, means when object space is not the same as world space. That is the case when using instance transforms or motion blur transforms.

There are similar examples inside the OptiX SDK you can find when searching for the word transform inside the examplesâ€™ *.cu sources.

You can see how the matrices from the transformation hierarchy are concatenated or applied when looking into the OptiX header which contains the respective helper functions inside optix_7_device_impl_transformations.h.