Quickest way to get byte array from output buffer?

I want to return my rendered image as byte array from a function so i was looking into the code used in the optixExamples to save images. This left me a little confused on the point where the data is transferred from the GPU device memory to “normal” memory. The workflow i have seen there is:

sutil::CUDAOutputBuffer to create sutil::ImageBuffer to use sutil::saveImage that uses stbi_write_png() that uses stbi_write_png_to_mem()

So my question is: Am i correct that the actual data is only available in memory at the end (with stbi_write_png_to_mem()) and is only on the device before? The sutil::ImageBuffer and sutil::saveImage function seem like “wrapper” functions to provide the possibility of handling multiple data types, so i could possibly skip them when only using a single data type.

Let’s take a step back from the OptiX SDK example code helper classes which are there for convenience only and are not required in your own applications.

The CUDA device memory is effectively a 64-bit pointer to device accessible memory, usually on the VRAM.
In the end that is simply a CUdeviceptr type, a 64-bit unsigned value.

You copy memory from host-to-device, device-to-host, or device-to-device with the resp. CUDA memcpy commands which slightly differ between the CUDA runtime and driver APIs.
There are also asynchronous versions of those which will take a stream argument and run in order with other commands inside the stream.

Please search for cudaMemcpy inside the OptiX SDK examples and look at their cudaMemcpyKind argument which controls the direction in which the copy happens.
Documented inside the CUDA Runtime Memory Management functions.

It’s often used to read back the compacted size of an acceleration structure inside the OptiX SDK:

    size_t compacted_gas_size;
    CUDA_CHECK( cudaMemcpy( &compacted_gas_size, (void*)emitProperty.result, sizeof( size_t ), cudaMemcpyDeviceToHost ) );

Now if you check with that information how the CUDAOutputBuffer class is reading the data back inside the
PIXEL_FORMAT* CUDAOutputBuffer<PIXEL_FORMAT>::getHostPointer() function, you’ll see how that is using cudaMemcpy() in some cases.
The crucial part in that function is that it’s using a map() function which works differently depending on where the device accessible memory has been allocated originally.

Inside the SDK examples the default case is using OpenGL interoperability, where the device buffer is allocated as a Pixel Buffer Object (PBO) on the OpenGL side and then accessed as a graphics resource in CUDA.
That is completely unnecessary when all you need is storing CUDA device memory to a file on disk!
Means you do not want that code path when the main goal is rendering into CUDA device memory with OptiX and saving the resulting data from the native CUdeviceptr back to the host memory. Instead you want the code paths which only deal with the m_device_pixels.

If you do not need any interactive display capability in your OptiX program, all that OpenGL code can go away. That would also be helpful when running the application on servers which aren’t implementing any graphics features, like when running on a GPU in Tesla Compute Cluster mode under Windows.

Thanks for the clarification and pointing me specifically to the cudaMemcpy commands! I understood the different directions it can be used. So the sutil::CUDAOutputBuffer class basically implements variants of functionality for handling different use-cases(like OpenGL interoperability) by using setting the CUDAOutputBufferType enum. I still have same questions though.

There are also asynchronous versions of those which will take a stream argument and run in order with other commands inside the stream.

I get that cudaMemcpyAsync() is the async version of cudeMemcpy() but the optixLaunch() function also takes a stream argument. If i understand you correctly a stream isn’t strictly necessary? Is it possible to launch without the stream argument?

The crucial part in that function is that it’s using a map() function which works differently depending on where the device accessible memory has been allocated originally.

In the code for the sutil::CUDAOutputBuffer the map() function is used to provide the src for the memcpy operation in the getHostPointer() function. In all the non-stream examples it just uses the m_device_pixels. What is the functionality in the ZERO_COPY case? It seems that there is memory allocated on the host for m_host_zcopy_pixels in the sutil::CUDAOutputBuffer::resize function that is than mapped with cudaHostGetDevicePointer().

I get that cudaMemcpyAsync() is the async version of cudeMemcpy() but the optixLaunch() function also takes a stream argument. If i understand you correctly a stream isn’t strictly necessary?

The cudaMemcpy calls without stream argument are synchronous, means they automatically wait for all work inside the context to have finished before doing the copy, which is often not what you want.
On the other hand, when using the asynchronous memcpy commands, you need to make sure the source and destination data pointers contain the correct values when the call is executed, which is not when you call it on the host side but potentially much later when the GPU is busy. So some care needs to be taken to call the required explicit stream synchronization calls to make sure the asynchronous operations have finished.

Is it possible to launch without the stream argument?

Nope. https://raytracing-docs.nvidia.com/optix7/api/group__optix__host__api__launches.html#ga089e2a00833cb952276c5d6e09b692da

All OptiX SDK entry point functions which take a stream argument are asynchronous. That includes the optixLaunch as well as optixAccelBuild calls for example.

There exists a default CUDA stream 0 (zero) which has specialized synchronization behavior, which can also affect other streams depending on how the CUDA context has been created. You usually want the scheduler to be non-blocking for fully asynchronous operations.
Again, please read the resp CUDA Programming Manual for more details.

What is the functionality in the ZERO_COPY case? It seems that there is memory allocated on the host for m_host_zcopy_pixels in the sutil::CUDAOutputBuffer::resize function that is than mapped with cudaHostGetDevicePointer().

“zero copy” or “page-locked” or “pinned memory” is a host memory allocation which is GPU device accessible.
Means the GPU can directly access some amount of host RAM via the PCI-E bus, and that in turn means it’s a lot slower than accesses to the VRAM on the board (like 12 GB/s vs. 670 GB/s). This is only needed for special cases if you know what you’re doing and should generally be avoided for output buffers which should better live in VRAM.

Just for reference, my OptiX 7 examples on github are only using the OptiX 7 SDK headers and none of the sutil helpers and do the CUDA resource management explicitly. It might make sense to look at them for a second opinion on an OptiX application framework. I don’t use the CUDA default stream 0 in my applications.
https://github.com/NVIDIA/OptiX_Apps

1 Like

Thanks a lot for the detailed explanation! It seems i have to get more familiar with CUDA in general. Thanks for pointing out your advanced samples with an implementation “from scratch”. Since we are building a POC and i have to jump between different domains time is always a big factor, so i was really happy about the sutil functionality (especially building acceleration structures from an input GLTF), since it allowed to get things up and running quickly (at the cost of understanding things in detail). I replaced our initial handwritten python raytracer with the optix framework because of obvious performance reason. So now i find myself in a spot where i see the need to dive deeper into CUDA to do things “right” at the cost of spending a larger time budget. If you have any advice on how and where best to engage with the complexity optix and CUDA bring to the table, I’d be glad to hear it. Thanks again for the amazing help you are provided already!

If you have any advice on how and where best to engage with the complexity optix and CUDA bring to the table, I’d be glad to hear it.

Well, that’s a wide topic.

My approach to learning new things is first to understand what is possible while having a specific use case in the back of my mind.

That’s where Programming Guides and API references come into play.
The online OptiX Programming Guide and API Reference have a nice search function in the top right.
The search results of the online CUDA Programming Manual site are sometimes a little broad. I also search inside the PDF version of that instead.

After finding out what is possible and what not, the next step is to figure out how the necessary things work.
The easiest way to learn that is from existing code and tutorials inside the SDK and whatever you can lay your hands on.
The issue with this is to find the good ones and not learn from bad example code, so the more programming experience you have, the better.

My most frequently used tool is Find In Files in source code editors (MSVS, VS Code, Notepad++, etc.)
VS Code can open whole folders and quickly search through all files conveniently. I sometimes have multiple of them open with different folders just for the searching functionality. 4K and multi-monitor setups for the win.
When you have a MSVS project running, all the tools which jump to function and structure definitions can help learning about how things work, since the headers defining these usually contain the documentation.

Means if you have, for example, questions how to copy memory, just search for words related to memory, copy, memcpy, etc. in all related example sources you have. Once you found the function doing that, look it up inside the include headers or Programming Manual again. Rinse, repeat.

When using OptiX, there is not really a need to dive too deep into the CUDA kernel programming because the OptiX single-ray programming model makes it simple to concentrate on what should happen per ray.
Some of the things CUDA device programing offers is not even available because the scheduling in OptiX forbids it, like warp wide synchronizations, shared memory, etc., though you can use everything in native CUDA kernels running on the input or output data outside the optixLaunch, as the optixRaycasting example shows.

You have to make a choice about the CUDA API you’re using on the host. The CUDA Runtime API is more high level and easier to use. There might also be some CUDA libraries which only work together with the runtime API. (Not my expertise.)
The CUDA Driver API is more low level, so it’s slightly harder to use, esp. when launching native CUDA kernels, but it has better control over the CUDA contexts which is why I use that especially for multi-GPU use cases.
(I’m showing the CUDA API differences between runtime and driver API in one introduction example.)

Then there are also books.
For CUDA programming just search the web for “CUDA Books” and you’ll get the standard ones (e.g. CUDA by Example, Programming Massively Parallel Processors, etc.)

For ray tracing beginners, have a look at the books from Peter Shirley. Online editions here: https://raytracing.github.io/.

Then there are the Raytracing Gems Books with a wide variety of topics.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.