Cuda kernel within Optix Programs

Optix version : 7.3.0

I have a closest hit program as follows

extern "C" __global__ void __closesthit__radiance() {
    // lot of matrix multiplication and addition which can benefit from parallel execution

    // compute bsdf and set radiance
    prd->radiance += light.emission * weight;
}

I have written the following functions which perform the matrix operations in a brute force way (nested for loops) and it takes around 20 seconds to render a single frame using optixPathtracer example available in Optix 7.3 SDK (just for reference the starter code without these matrix operations takes 50 ms per frame) .

static __forceinline__ __device__ Mat mat_add(Mat A, Mat B)
static __forceinline__ __device__ Mat mat_mul(Mat A, Mat B) 

To improve upon this, I tried launching the following “nested” cuda kernel within the CH program but that gave the same error as in this post.

mat_add_kernel<<<numBlocks, threadsPerBlock>>>(A, B, C);

I also read about the callables mentioned in the same post and was wondering if they can help with this usecase. I am assuming callables will not help with this because they cannot facilitate efficient matrix operations. I am still learning Optix, please correct me if I am mistaken about something.

You simply cannot launch native CUDA kernels inside OptiX device code.

An OptiX launch is executing a kernel already and the arguments to the launch call define the dimension of that kernel (the number of threads) and all scheduling is completely internal to OptiX.

Think of a ray generation program which is not shooting any rays, means never calling optixTrace(), as a CUDA kernel, though with some limitations on the available CUDA features like shared memory or specific synchronization calls.
Explained here: https://raytracing-docs.nvidia.com/optix7/guide/index.html#program_pipeline_creation#programming-model

OptiX is using a single ray programming model. You can of course multiply matrices inside OptiX device code, and things will happen in parallel depending on the launch dimensions.

What I think happened in your case to result in 20 seconds per frame is that you multiplied all matrices in all threads instead of each matrix on one thread (that means per launch index).

It’s unclear what matrices you’re trying to apply there.

Usually you’d need to transform object coordinates to world coordinates inside the closest hit program which is always required when there are any non-identity transformations inside the transform list above a hit primitive, means when object space is not the same as world space. That is the case when using instance transforms or motion blur transforms.

Examples which show how to transform the vertex attributes from object to world space can be found here for example.
The first example is doing that hardcoded for a single level instance case (means only one level of transformations in the scene, an IAS-> GAS structure):
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_runtime/shaders/closesthit.cu#L150
The second example shows the fully general case since that uses additional transforms for the motion blur matrices inside the transformation hierarchy:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/intro_motion_blur/shaders/closesthit.cu#L69

There are similar examples inside the OptiX SDK you can find when searching for the word transform inside the examples’ *.cu sources.

You can see how the matrices from the transformation hierarchy are concatenated or applied when looking into the OptiX header which contains the respective helper functions inside optix_7_device_impl_transformations.h.

Please keep on reading through the OptiX 7.4 Programming Guide and API Reference
https://raytracing-docs.nvidia.com/optix7/index.html

2 Likes

Thank you.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.