How to write from closesthit() to a device buffer

I don’t know if my problem is more related to OptiX 7.0 or Cuda.
If my question is more related to CUDA, I will look in their forum for an answer.
(I am sadly only an engineer and not a computer scientist, so I hope my question isn’t too trivial)

I am trying to use OptiX for Thermal simulation and want to access how much “Energy” (represented by rays hitting it) reaches each triangle.
Later I will also include reflections (with a different “ray energy”), that’s why I did set up a float buffer with the size of my Triangle Count:

CUdeviceptr d_hitcounter = 0;
size_t float_size = sizeof(float);
CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&d_hitcounter),float_size*TRIANGLE_COUNT));
CUDA_CHECK(cudaMemset(reinterpret_cast<void*>(d_hitcounter), 0, float_size*TRIANGLE_COUNT));

Since Nsight doesn’t seem to work for OptiX 7 jet, I don’t know if I did set up the buffer correctly.

Now I want my ch program to write the “ray-energy” to the to the d_hitcount pointer + triangle index. Specifically how do I pass the pointer to the ch function?
Do I have to pass the pointer to the ch() as part of the ray payload? I couldn’t find anything in the documentation, but maybe I have been looking in the wrong place.

Also, how can I prevent two OptiX shaders wanting to write simultaneously to the same address. Is it efficient to use atomicadd to prevent this and will this work within OptiX? As this addition will have to be performed by each ray with an intersection I wanted to make sure to perform this addition in the fastest way possible.

Additionally am I correct to assume, that it’s easiest to assign the ray energy as a ray payload?

Thank you very much for your time and help in advance!
I am looking forward to your answer.

Hi martinhwehr,

I recommend studying the OptiX 7 example called “optixRaycasting”. This sample is structured to write the ray tracing results to a buffer, which is then processed by a separate CUDA kernel.

This should give you some ideas of how to handle your payload and the mechanics of writing to a buffer. If it doesn’t answer your questions, please write back and we can offer more guidance.

Nsight Compute, Nsight Systems, and Nsight VSE (on Windows), should all work reasonably well with OptiX 7, as long as you’re using a very recent driver. cuda-gdb works on Linux, though it’s not as well supported as Nsight.

You can use atomics to prevent two threads from writing to the same memory address. The best advice for performance is to try hard to avoid needing atomics, but you can use them if you need. There is a bit more information about what is allowed here:

https://raytracing-docs.nvidia.com/optix7/guide/index.html#program_pipeline_creation#program-input

“The NVIDIA OptiX programming model supports a Multiple Instruction, Multiple Data (MIMD) subset of CUDA. Execution must be assumed to be independent of other threads. Thus, shared memory usage and warp-wide or block-wide synchronization—such as barriers—are not allowed in the input PTX code. Apart from these constraints, all GPU instructions are allowed including math, texture, atomic operations, control flow, and memory loads/stores. Special warp-wide instructions like vote and ballot are allowed, but can yield unexpected results as the locality of threads is not guaranteed and neighboring threads can change during execution, unlike in the full CUDA programming model. Still, warp-wide instructions can be used safely if the algorithm in question is independent of locality, for example, by implementing warp-aggregated atomic adds.”


David.

Hi David,

thank you for your answer!
After checking out the Raycasting example, I did see that the params structure can be accessed from the ch program.
I modified the params struc to hold an unint64_t called p_hitcounter.

CUdeviceptr d_hitcounter = 0;
size_t float_size = sizeof(float);
CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&d_hitcounter),float_size*TRIANGLE_COUNT));
CUDA_CHECK(cudaMemset(reinterpret_cast<void*>(d_hitcounter), 0, float_size*TRIANGLE_COUNT));
	
state.params.p_hitcounter = d_hitcounter;

I was then able to print my address from within the ch program. Is this the way it’s supposed to be done?

As for the use of Nsight (I am using VS19): I am still running on the 436.48 driver. I will update to the latest version and see if I am able to read out my device memory.

Thank you for your help!
Martin

There’s nothing wrong with putting your buffer pointer into your launch params, that is the preferred way to do it if you only have a few buffer pointers.

You can even put small buffers directly in launch params, but note that launch params is limited to a total of 64k, so no large data will fit in there.

The alternative to putting your pointers in launch params is to put the pointers you need into your hit record, which goes in the shader binding table. You would do that if you have too many pointers to fit into the launch params.


David.

With 64k I would have to use models with less than 3000 unique triangles. Is there a performance downside of using a (I think it’s global?) memory buffer, instead of writing in the params struct?
We might be using bigger models in the future, so if it won’t impact performance we will just use a global buffer and just pass the pointer in the params struct.

Below is my threadUNsafe version of the code.
To make it threadsafe I would have to rewrite line 7 with atomicAdd().
How can I achieve this? Do I have to call a CUDA Kernel from within OptiX, or what would be the best way?

I am still working on a GT 1030 until my new RTX2080TI will arrive, I assume this isn’t the cause of the problem?

extern "C" __global__ void __closesthit__radiance()
{
  const int prim_idx = optixGetPrimitiveIndex();
  double *d_p_hitcounter = reinterpret_cast<double*>(params.p_hitcounter); //todo: move conversion out of ch()
  d_p_hitcounter = d_p_hitcounter + prim_idx;  //todo: have to check if prim_idx needs to multiplied by size_double
  double rayenergy = 1.0;
  *d_p_hitcounter = static_cast<double>(*d_p_hitcounter + rayenergy);
}

Thank you for help!
Martin

You cannot actually write into the launch parameter buffer itself from the device anyway because that is constant memory.
Use CUdeviceptr fields in the launch params which point to the memory buffer on the device which is read/write from shaders.

Mind that double precision atomics do not exist before SM 6.0 (Pascal).
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications

Doubles will generally be slower than float when used in device code for any calculations.
Most of the time by a lot! (Volta GPUs would slow down the least.)
If possible it’s recommended to use float 32-bit types instead.

If you’re just counting, use an unsigned integer instead.

Other than that, you should be able to simply write line 7 as

atomicAdd(d_p_hitcounter, rayenergy);

Thank you for your answer!
For increased performance I will revert back to floats.

I already tried to change the addition in the way you suggested before I wrote my last post. Somehow VS19 says that atomicAdd is not defined. I thought that would due to the 'extern “C” ’ tag of the ch() program. This is why i asked if I had to call in an external CUDA function.

My build fails.

I did some digging and it seems to be the case that I somehow need to set my CUDA_ARCH to 600 or higher. The GT1030 should have a compute capability of 6.1 and thus support atomicAdd. Is this issue a question for the CUDA forum, or can you help me with my problem?

I appreciate the help and if this is not a trivial problem I can ask my question again in the CUDA forum.

I run VS19, CUDA 10.1 and the 441.87 driver.

Somehow my error seems to lie somewhere else.
I started from the optixPathtracer example.

Somehow when I try to run my modified program (or the original version from the SDK), I get the following error:

[ 2][ ERROR]: Multi-level graphs are disabled but “maxTraversableGraphDepth” is larger than 2
Caught exeption: OPTIX_ERROR_INVALID_VALUE: Optix call ‘optixPipelineSetStackSize( state.pipeline, direct_callable_stack_size_from_traversal, direct_callable_stack_size_from_state, contiunation_stacksize, max_traversal_depth )’ failed: C:\Program Files (x86)\NVIDIA Corporation\Optix\SDK\Reverse_PathTrace\optixPathTracer.cpp:643)

when i change

const uint32_t max_traversal_depth = 3;

to 2 i get the the following error:

[ 2][ ERROR]: Only single gas graphs are enabled but “maxTraversableGraphDepth” is not 1
Caught exeption: OPTIX_ERROR_INVALID_VALUE: Optix call ‘optixPipelineSetStackSize( state.pipeline, direct_callable_stack_size_from_traversal, direct_callable_stack_size_from_state, contiunation_stacksize, max_traversal_depth )’ failed: C:\Program Files (x86)\NVIDIA Corporation\Optix\SDK\Reverse_PathTrace\optixPathTracer.cpp:643)

Setting it to 1, and it compiled fine. Also atomicAdd() works fine, but is not recognized by intellisense (which isn’t a problem, i guess).

As far as I know, I didn’t change anything except for updating the drivers from 436.48 to 441.87.
Can you tell me, why that happened and is it even a problem?

Thank you for your help and I hope I am not bothering you too much.

Martin

That should only be a matter of includes inside your OptiX *.cu program source code.

Visual Studio shouldn’t be directly involved in the compilation from CUDA C *.cu files to the input *.ptx code for OptiX. That is done by NVCC in the end. I assume that code line doesn’t appear inside a *.cpp file.

The relevant CUDA atomicAdd() functions are defined inside the CUDA headers device_atomic_functions.h and sm_60_atomic_functions.h.
Those cannot be included directly, or device_functions.h for that matter.

In OptiX 7 you probably only need to add #include <cuda_runtime.h> inside that shader.

That’s an oversight inside the optixPathTracer which was revealed by stricter checks in more recent drivers, after the first one supporting OptiX 7.
https://devtalk.nvidia.com/default/topic/1066486/optix/optix-7-examples-error/post/5401320
https://devtalk.nvidia.com/default/topic/1067032/optix/optix7-optixpathtracer-sample-won-t-run-with-latest-driver-441-20/post/5404549

Exactly!

I have another question regarding the optix7course by Ingo Wald

The mesh viewer in the SDK examples seems to lack a regular OptiX pipeline. At least I could find any calls for raytracing in this example. It seems to lack a “device functions”.cu and I was wondering if this example really is rendered without any raytracing.
If this isn’t my second part of the question is obsolete and I only need to understand the raytracing calls and were the device functions are called so I can modify them accordingly.

Since I also struggled with figuring out a way to read out the quantity of primitives from the GLTF loader in the SDK examples (it’s printed to the console, but I wasn’t able to trace back which variable is used and how to access it), I decided to switch to the ex07_firstRealModel where I at least manage read out the number of vertexes from an .obj file.

With the amount of primitives I want to set up a buffer to add up the “rayenergy” each primitive experiences, still using the same code I already posted earlier:

CUdeviceptr d_hitcounter = 0;
size_t float_size = sizeof(float);
CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&d_hitcounter),float_size*TRIANGLE_COUNT));
CUDA_CHECK(cudaMemset(reinterpret_cast<void*>(d_hitcounter), 0, float_size*TRIANGLE_COUNT));

I then want to divide these obtained values by the number of cast rays and thus obtain the absolute energy transferred from the source (and later also reflections) to each primitive.

Now to my actual problem: I don’t know where to initialize the buffer.
For accessibility reasons I thought it would be the best to set it up in the main function and then pass the pointer to the launchparams. However when I add it after the

Model *model loadOBJ(...)

I can’t compile because it doesn’t recognize cudaMalloc.
Am I lacking some #includes?

Also when I put it in the SampleRenderer.cpp infront the OptiX initialisation/pipelinecreation it also fails.

Is this even the correct place to set up the buffer. Is my way of trying to achieve my goal even logical/efficient?

I you were able to understand were I’m struggling and I didn’t confuse you too much with my question.
Thank you for your help in advance, your input always always very helpful.

The simplest approach to understand how any of the examples work, would be to compile them as debug target and then single step through the code inside the debugger to see exactly what calls are done in what order.

Have the OptiX API Reference open and look at these calls and structures in parallel.
Both the programming guide and the API reference have a search field in the top right.
The programming guide also cross-links to the API documentation.
https://raytracing-docs.nvidia.com/optix7/index.html

Also if you need to find anything inside other people’s source code, “Find In Files” is your friend.
For example, if there is an output message of a program you’re looking for, search for the fixed string in that message over all sources and look at all hits.
With that method you’ll see that the GLTF loader used inside the optixMeshViewer prints the “Num triangles:” string in file “OptiX SDK 7.0.0\SDK\sutil\Scene.cpp” line 198.

If you have trouble getting things to compile and work in your own framework, it helps copying a working example and start changing that until you either have what you need or can apply the required concepts to your own application.