The API docs available here: https://raytracing-docs.nvidia.com/optix6/api_6_0/html/index.html now seem to only display the C++ API. Could we get the C API docs back please?
Thanks, I’ll inform our technical writer.
Looks like there is something wrong with the left frame. The functions are still there though.
In the left frame please navigate to NVIDIA OptiX 6.0 API > Files > File List > optix_host.h
The search field in the top-right also finds them.
The OptiX 6.0 API Reference “Modules” section is back again.
Please start at this hub for up-to-date NVIDIA ray tracing documentation: https://raytracing-docs.nvidia.com
Also notice the new OptiX 7.0 documentation on that site.
Yes I noticed the new optix 7 docs. I’m sure I’ll have lots of questions when I work up the courage to rewrite my renderer :D
Is there a good guide to the basics of CUDA streams as required for optix 7 other than the CUDA docs? It’s not obvious to me how to take advantage of the asynchronous nature of the streams - raygen programs will still use the entire device won’t they do multiple launches for sample iterations will still be serial?
Or is it more that I could multi thread my scene graph updates, using multiple streams , then use events to synchronize before launching the raygen program (or just have the raygen program on the default stream so it will automatically sync other streams before launching).
It’s exactly like in CUDA, all functions taking a CUDA stream argument are asynchronous.
When using launches and asynchronous CUDA memory copies, the usual care needs to be taken to parallelize or synchronize the input and output data.
Add one more dependency when rendering on multiple devices, maybe via peer-to-peer, then synchronization needs to take the other devices into account.
Using multiple streams in a progressive renderer which works on complete sub-frames would be a little involved and possibly not worth it. (Not so much for a bucket renderer producing final tiles.)
That would require a pipeline per stream and different pipeline parameters per optixLaunch(). See the note on concurrent launches here:
The accumulation of full frames from multiple streams into one shared buffer could only use atomicAdd and then your performance is worse. With multiple buffers you’d need a proper initialization strategy because not all streams start at frame 0, and a separate CUDA kernel would need to composite the results in either case. It’s getting complicated.
The OptiX SDK 7 example optixRaycasting generates two pipelines, two streams, and two device side pipeline parameters blocks and launches them both, but that example passes on concurrency because there is a cudaMalloc in between the launches which incurs a device synchronization. :-(
If you’re intending to parallelize such things I would recommend to write an arena allocator to avoid many calls to cudaMalloc.
Building acceleration structures from multiple host threads in parallel will work in OptiX 7.
Updating anything in the active scene graph will require the proper synchronization against the running launches or must all be asynchronous in the same stream.
You would change the instance acceleration structure and they are inherently connected to the shader binding table layout and you cannot change that in the middle of a launch.
The optixLaunch() takes a CUDA stream argument and is therefore asynchronous.
If you have different tasks to do in parallel on the CPU, that would just happen automatically.
If you have different task to do on the same device, you would need a separate non-blocking CUDA stream.
I have not tried running OptiX on the default stream at all. I create own streams with CU_STREAM_NON_BLOCKING and those aren’t synchronized against other streams.
Reading back data to the host after the OptiX launches involves a cuMemcpy* or cuMemcpy*Async. (I prefer the CUDA Driver API due to the more explicit multi-GPU context management.)
The synchronous copy would wait for the context to finish pending operations, the asynchronous one would need to be waited for explicitly with a cuStreamSynchronize(stream) before accessing the copied data on the host.
Some other calls are synchronous as well, for example when doing OpenGL interop with cuGraphicsMapResources() and cuGraphicsUnmapResources().
So to really have completely asynchronous launches you would need to copy the per frame data like the iteration index with an asynchronous memcpy into your context global variables block you specify via the OptixPipelineCompileOptions.pipelineLaunchParamsVariableName and the sources would need to be parallelized because you never know when the cuMemcpHtoDAsync will actually been called.
If you have any concurrent work to do beside that, use different CUDA kernels on a different non-blocking stream.
Means it’s possible to push many launches for different frames or tiles into a single CUDA stream at once which will drastically impact (reduce) the interactivity.
In my progressive renderers I normally show an update every second as long as nothing changes to save bandwidth. When trying that fully asynchronous launch method, my renderer had already so many frames launched into the stream that a synchronize to display the currently rendered result waited for multiple seconds (at >50 fps). I implemented a benchmark mode which does that for final frame rendering, where I only want the finished frame stored to disk. That gives me the maximum ray tracing performance I compare against interactive display.