OptixHello: how can I access the values of device data?

fatecasino · May 24, 2023, 2:07pm

Hi,
I am trying to put the basic concepts of OptixHello together and I would like to ask how can I get the actual data from the device in order to apply some math calculations.

Lets say in OptixHello we are painting the window one solid color:

    rg_sbt.data = { 0.111f, 0.222f, 0.333f };
    
    CUDA_CHECK(cudaMemcpy(
        reinterpret_cast<void*>(raygen_record),
        &rg_sbt,
        raygen_record_size,
        cudaMemcpyHostToDevice
    ));

Is this the part that the device memory fills up with triplets 0.111f, 0.222f, 0.333 for each pixel?
Is there a sample code that would copy these triplets from the Device in a local variable?

Thanks!

droettger · May 24, 2023, 2:44pm

Please read these posts:
https://forums.developer.nvidia.com/t/going-through-optix7course-and-am-confused-about-launchparams-and-how-to-get-depth-buffer/201439/2
https://forums.developer.nvidia.com/t/optix-launch-parameters-best-practices/231443/2

If you search the OptiX SDK *.cpp source files for cudaMemcpyDeviceToHost you’ll find such cases.
The optixConsole application is one of the simplest examples which is not using and interop and graphics display and it does this call to copy the output buffer from device to host:

CUDA_CHECK( cudaMemcpy( output_buffer.data(), state.params.frame_buffer, width * height * sizeof( uchar4 ), cudaMemcpyDeviceToHost ) );

The optixHello example is using the CUDAOutputBuffer helper class and that copy from device to host happens inside its getHostPointer() function which handles the necessary steps depending on how that buffer had been allocated.

Mind the actual data type, it’s not floating point in these examples but uchar4.
For performance reasons it’s not recommended to use 3-component vector data types like float3 for output buffers.
float4 is faster to read and write because there are vectorized .v4 instructions for these and 2-component vectors. Similar for other basic types.

fatecasino · June 2, 2023, 3:45pm

Thanks for your reply, I put together your suggestions and I think I am getting closer to the main concepts that I didn’t have have a clear understanding. So now, I have followed the Optix process:

Created module
Created program groups
Linked pipeline
Set up shader binding table
launched

		CUstream stream;
		CUDA_CHECK(cudaStreamCreate(&stream));

		Params params;
		params.image = output_buffer.map();
		params.image_width = sz.cx;

		CUdeviceptr d_param;
		CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&d_param), sizeof(Params)));
		CUDA_CHECK(cudaMemcpy(
			reinterpret_cast<void*>(d_param),
			&params, sizeof(params),
			cudaMemcpyHostToDevice
		));
OPTIX_CHECK(optixLaunch(pipeline, stream, d_param, sizeof(Params), &sbt, sz.cx, sz.cy, /*depth=*/1));
		CUDA_SYNC_CHECK();

Then I created an output buffer:
sutil::CUDAOutputBuffer<float4>output_buffer(sutil::CUDAOutputBufferType::CUDA_DEVICE, sz.cx, sz.cy);

Now I need somehow to copy the output buffer to a host buffer:

cudaMemcpy(void* dst, void* src, size_t nbytes, ,enum cudaMemcpyKind direction);

I suppose that uchar4 is the fastest way to do this cudaMemcpy, but since I need the buffer to be in float format, how do I declare the local float buffer before I cudaMemcpy to it? A sample code would be really helpful!
Thanks!

droettger · June 5, 2023, 7:18am

You would need to setup your launch parameters before the launch and esp. the output buffer device pointer needs to be set in that before the launch or you get illegal access errors.

I suppose that uchar4 is the fastest way to do this cudaMemcpy,

memcpy doesn’t care what type your buffer is. It takes the size in bytes.

how do I declare the local float buffer before I cudaMemcpy to it?

I repeat: The optixHello example is using the CUDAOutputBuffer helper class and that copy from device to host happens inside its getHostPointer() function which handles the necessary steps depending on how that buffer had been allocated.

If you look at the CUDAOutputBuffer getHostPointer() function code, you’ll see how the class is maintaining a host side allocation in its m_host_pixels member variable already (See std::vector<PIXEL_FORMAT> m_host_pixels;). No need to allocate your own host side data.

And again, the optixConsole application shows how to do that without using the CUDAOutputBuffer helper class. It’s simply using a std::vector (named output_buffer) for that.
Search for this code inside the optixConsole applicaton.

    unsigned int width  = state.params.width;
    unsigned int height = state.params.height;

    std::vector<uchar4> output_buffer( width * height );
    CUDA_CHECK( cudaMemcpy( output_buffer.data(), state.params.frame_buffer, width * height * sizeof( uchar4 ), cudaMemcpyDeviceToHost ) );

with the same size as the for the device buffer

    state.params.width  = 48u * 2u;
    state.params.height = 32u * 2u;
    ...
    CUDA_CHECK( cudaMalloc( reinterpret_cast<void**>( &state.params.frame_buffer ),
                            state.params.width * state.params.height * sizeof( uchar4 ) ) );

You would just need to change the type to float4 in your case.

how can I get the actual data from the device in order to apply some math calculations.

Once you get everything working, depending on what calculations you need to do, it might be a lot faster to do these calculations with native CUDA kernels on the device data.
An example which uses CUDA kernels to generate rays and calculate shading on the intersection data can be found inside the optixRaycasting example which implements a wavefront rendering approach.

fatecasino · June 8, 2023, 3:21pm

Thank you for you help, after following the example with the vector I managed to get the output data on a host variable (without using the help getHostPointer() helper function). I am trying to go step by step and now my problem is this:

The results per pixel are in uchar4 format, i.e. each color is represented by 4 unsigned integers RGBA, how could it be possible to set the Optix to calculate each pixel color on a percentage color scale (0.0 - 1.0), which means that each byte is divided by 255. Setting the buffer format at float4 didn’t help:

sutil::CUDAOutputBuffer<float4> output_buffer(sutil::CUDAOutputBufferType::CUDA_DEVICE, sz.cx, sz.cy);

MyParams params;//struct that takes a float4* image
params.image = output_buffer.map();
params.image_width = sz.cx;

std::vector<float4> output_bufferVec(num_pixels);
CUDA_CHECK(cudaMemcpy(output_bufferVec.data(), params.image, num_pixels * sizeof(float4), cudaMemcpyDeviceToHost));

The output float vector get not values:

output_bufferVec
{x=-1.71297544e+38 y=-1.71297544e+38 z=-1.71297544e+38 w=-1.71297544e+38}
{x=-1.71297544e+38 y=-1.71297544e+38 z=-1.71297544e+38 w=-1.71297544e+38}
{x=-1.71297544e+38 y=-1.71297544e+38 z=-1.71297544e+38 w=-1.71297544e+38}
...
{x=-1.71297544e+38 y=-1.71297544e+38 z=-1.71297544e+38 w=-1.71297544e+38}

droettger · June 11, 2023, 12:23pm

Seriously, either use the CUDAOutputBuffer implementation as it’s meant to be used, or don’t use it at all.

Your current code excerpt is not showing if there is the proper unmap() call on the output_buffer, which in your case of a native device buffer doesn’t do anything, but would be critical if CUDAOutputBuffer is an OpenGL interop buffer.

Setting the buffer format at float4 didn’t help:

You are solely responsible for the OptiX device programs which produces the data and writes it to your output buffer.

If you want to write float4 data into your output_buffer, then you would need to change the OptiX device CUDA C++ program code you used to generate the rendering pipeline.
In the case of the optixHello example source code, you would need to change the ray generation program, which is the only program it’s using, from writing uchar4 to float4 data, e.g. like this:

extern "C"
__global__ void __raygen__draw_solid_color()
{
  uint3 launch_index = optixGetLaunchIndex();
  RayGenData* rtData = (RayGenData*)optixGetSbtDataPointer();
  params.image[launch_index.y * params.image_width + launch_index.x] = 
    // make_color( make_float3( rtData->r, rtData->g, rtData->b ) );  // Old code: uchar4 output buffer
    make_float4( rtData->r, rtData->g, rtData->b, 1.0f ); // New code: float4 output buffer:

Simple as that.
Please work through the OptiX Programming Guide and the OptiX SDK and other examples’ source code I linked to in earlier threads. There are plenty of examples which write float4 data into buffers.

fatecasino · July 7, 2023, 1:22pm

I tried changing this line in draw_solid_color.cu, clean optixHello, rebuiild, but changes do not apply. Even if I replace it with zeros:

make_color(make_float4(0.0 , 0.0 , 0.0 , 1.0f));

Nothing happens, I get the same color as before. If I set any breakpoint in this function debugger does not stop at all in __raygen__draw_solid_color()

How do I apply changes in a .cu file in the samples?

droettger · July 7, 2023, 2:22pm

In the following steps replace the OptiX SDK version number with what you’re using.

With a clean OptiX SDK 7.7.0 installation:
Do you build the OptiX SDK example from the top-level CMakeLists.txt file inside the SDK folder?
Did you pick a separate build folder inside the CMake GUI? Something like <your_path>/OptiX SDK 7.7.0/SDK/build
That should generate a MSVS solution inside that build folder named OptiX-Samples.sln.
If you open that inside your MSVS version, there should be all OptiX SDK examples listed, including optixHello.
If you build them all as Release x64 target, their executables should land in <your_path>\OptiX SDK 7.7.0\SDK\build\bin\Release.
Open a command prompt and change directory to that folder.
Issue the command optixHello.exe. Ther should open a wiondow with a green background.

Now, inside your MSVS version, open the optixHello draw_solid_color.cu file.
Change it from
make_color( make_float3( rtData->r, rtData->g, rtData->b ) );
to
make_color( make_float3( 1.0f, 0.0f, 0.0f ) );
and build the solution.

Go to the command prompt again and start optixHello.exe again.
Now the color of the client window area should be pure red.

Please try to get that working inside a clean OptiX SDK installation.

After you built all SDK examples and got them running from the command prompt, changing the optixHello draw_solid_color.cu file inside the MSVS IDE will generate an updated optixHello_generated_draw_solid_color.cu.optixir (or *.ptx file when you disabled the SAMPLES_INPUT_GENERATE_OPTIXIR variable inside the CMake GUI).
That should land in <your_path>\OptiX SDK 7.7.0\SDK\build\lib\ptx\Release
Make sure that destination file is actually updated after you changed and rebuild the draw_solid_color.cu file.

This must work first. Don’t touch anything inside the OptiX SDK until you get this right.

OptiX device code debugging might not work like that depending on the display driver you’re using. It’s unfinished and partly broken in R530 drivers.
Read this OptiX Programming Guide chapter 6.1 for the prerequisites and this thread for current issues and workarounds: https://forums.developer.nvidia.com/t/debugging-is-broken-after-updating-to-cuda-12-1/245616

fatecasino · July 10, 2023, 3:12pm

I had built the samples following the default instructions (created a build folder, etc).
Everything works fine after following your help. Threre’s so much to learn from this hello world example :)
I realised that the make_color() function converts all float4 colors to uchar4 ones.

__forceinline__ __device__ uchar4 make_color( const float3& c )
{
    // first apply gamma, then convert to unsigned char
    float3 srgb = toSRGB( clamp( c, 0.0f, 1.0f ) );
    return make_uchar4( quantizeUnsigned8Bits( srgb.x ), quantizeUnsigned8Bits( srgb.y ), quantizeUnsigned8Bits( srgb.z ), 255u );
}
__forceinline__ __device__ uchar4 make_color( const float4& c )
{
    return make_color( make_float3( c.x, c.y, c.z ) );

Two quick questions:

Is the use of uchar4 instead of float4 a choice that increases slightly the performance due to smaller buffer size? In other samples I saw that state.params.accum_buffer is used for float4 pixel representation.

    params.frame_buffer[image_index] = make_color( acc_val );
    params.accum_buffer[image_index] = acc_val;

In case I want to build an application that uses float4 color representation, I make the changes in the draw_solid_color.cu, clean, rebuild the optixHello (which means new optixir files) and I am done. Now, if I want to work on a different application that uses the uchar4 color representation, should I roll back everything, undo the changes mentions above and clean/rebuild the optixHello? What’s the common way to work with variable parametrization of the samples? Copying the proper optixir files in a separate location and reference it from my MSVS application project?

droettger · July 10, 2023, 4:05pm

Is the use of uchar4 instead of float4 a choice that increases slightly the performance due to smaller buffer size?

Yes, the performance of copying the ray traced device buffer to the OpenGL texture for display depends on the size of the buffer and using uchar4 is four times smaller than float4.

The accumulation buffer in examples like the optixPathTracer is a device-local buffer which only holds the accumulated buffer in HDR format using float4. That contains the linear colors of the light transport algorithm which gets then converted into the uchar4 buffer for final display.

In case I want to build an application that uses float4 color representation, I make the changes in the draw_solid_color.cu, clean, rebuild the optixHello (which means new optixir files) and I am done.
Now, if I want to work on a different application that uses the uchar4 color representation, should I roll back everything, undo the changes mentions above and clean/rebuild the optixHello?
What’s the common way to work with variable parametrization of the samples? Copying the proper optixir files in a separate location and reference it from my MSVS application project?

You have the complete freedom inside your applications to implement whatever you need.
Adding own independent example applications into the OptiX SDK framework is as simple as copying one of the original example folders, renaming the folder itself and the project name inside the copied CMakeLists.txt, and adding that new folder as as sub-directory to the next higher CMakeLists.txt with all the other SDK examples.

As said in previous posts, that will incur some undesired hardcoded dependencies on the OptiX SDK examples and data folders and you should strive to build a completely separate OptiX application framework outside the OptiX SDK examples framework for any professional/commercial use. Consider the OptiX SDK framework as playground for simple OptiX experiments.
In the end, none of the files used in your own OptiX application should reside anywhere inside your OptiX SDK installation location. That would allow switching between different OptiX SDK versions without any changes to your own project source, aside from the OptiX SDK used.

One such separate framework can be found inside my OptiX 7 advanced samples which do not use anything from the OptiX SDK outside the include folder: https://forums.developer.nvidia.com/t/optix-advanced-samples-on-github/48410/4

One of the examples also shows how to switch between different output buffer formats inside the intro_denoiser example where I define the launch parameters to use a generic CUdeviceptr for the output buffers which can be switched between half4 and float4 formats at compile time with a define inside the app_config.h which is then used to reinterpret the buffer types accordingly inside the ray generation program and when generating and uploading the texture data inside the Application.cpp code. Search for USE_FP32_OUTPUT inside that example source code to find all places.