OptixHello: how can I access the values of device data?

Hi,
I am trying to put the basic concepts of OptixHello together and I would like to ask how can I get the actual data from the device in order to apply some math calculations.

Lets say in OptixHello we are painting the window one solid color:

    rg_sbt.data = { 0.111f, 0.222f, 0.333f };
    
    CUDA_CHECK(cudaMemcpy(
        reinterpret_cast<void*>(raygen_record),
        &rg_sbt,
        raygen_record_size,
        cudaMemcpyHostToDevice
    ));

Is this the part that the device memory fills up with triplets 0.111f, 0.222f, 0.333 for each pixel?
Is there a sample code that would copy these triplets from the Device in a local variable?

Thanks!

Please read these posts:
https://forums.developer.nvidia.com/t/going-through-optix7course-and-am-confused-about-launchparams-and-how-to-get-depth-buffer/201439/2
https://forums.developer.nvidia.com/t/optix-launch-parameters-best-practices/231443/2

If you search the OptiX SDK *.cpp source files for cudaMemcpyDeviceToHost you’ll find such cases.
The optixConsole application is one of the simplest examples which is not using and interop and graphics display and it does this call to copy the output buffer from device to host:

CUDA_CHECK( cudaMemcpy( output_buffer.data(), state.params.frame_buffer, width * height * sizeof( uchar4 ), cudaMemcpyDeviceToHost ) );

The optixHello example is using the CUDAOutputBuffer helper class and that copy from device to host happens inside its getHostPointer() function which handles the necessary steps depending on how that buffer had been allocated.

Mind the actual data type, it’s not floating point in these examples but uchar4.
For performance reasons it’s not recommended to use 3-component vector data types like float3 for output buffers.
float4 is faster to read and write because there are vectorized .v4 instructions for these and 2-component vectors. Similar for other basic types.

Thanks for your reply, I put together your suggestions and I think I am getting closer to the main concepts that I didn’t have have a clear understanding. So now, I have followed the Optix process:

  • Created module
  • Created program groups
  • Linked pipeline
  • Set up shader binding table
  • launched
		CUstream stream;
		CUDA_CHECK(cudaStreamCreate(&stream));

		Params params;
		params.image = output_buffer.map();
		params.image_width = sz.cx;

		CUdeviceptr d_param;
		CUDA_CHECK(cudaMalloc(reinterpret_cast<void**>(&d_param), sizeof(Params)));
		CUDA_CHECK(cudaMemcpy(
			reinterpret_cast<void*>(d_param),
			&params, sizeof(params),
			cudaMemcpyHostToDevice
		));
OPTIX_CHECK(optixLaunch(pipeline, stream, d_param, sizeof(Params), &sbt, sz.cx, sz.cy, /*depth=*/1));
		CUDA_SYNC_CHECK();

Then I created an output buffer:
sutil::CUDAOutputBuffer<float4>output_buffer(sutil::CUDAOutputBufferType::CUDA_DEVICE, sz.cx, sz.cy);

Now I need somehow to copy the output buffer to a host buffer:

cudaMemcpy(void* dst, void* src, size_t nbytes, ,enum cudaMemcpyKind direction);

I suppose that uchar4 is the fastest way to do this cudaMemcpy, but since I need the buffer to be in float format, how do I declare the local float buffer before I cudaMemcpy to it? A sample code would be really helpful!
Thanks!

You would need to setup your launch parameters before the launch and esp. the output buffer device pointer needs to be set in that before the launch or you get illegal access errors.

I suppose that uchar4 is the fastest way to do this cudaMemcpy,

memcpy doesn’t care what type your buffer is. It takes the size in bytes.

how do I declare the local float buffer before I cudaMemcpy to it?

I repeat: The optixHello example is using the CUDAOutputBuffer helper class and that copy from device to host happens inside its getHostPointer() function which handles the necessary steps depending on how that buffer had been allocated.

If you look at the CUDAOutputBuffer getHostPointer() function code, you’ll see how the class is maintaining a host side allocation in its m_host_pixels member variable already (See std::vector<PIXEL_FORMAT> m_host_pixels;). No need to allocate your own host side data.

And again, the optixConsole application shows how to do that without using the CUDAOutputBuffer helper class. It’s simply using a std::vector (named output_buffer) for that.
Search for this code inside the optixConsole applicaton.

    unsigned int width  = state.params.width;
    unsigned int height = state.params.height;

    std::vector<uchar4> output_buffer( width * height );
    CUDA_CHECK( cudaMemcpy( output_buffer.data(), state.params.frame_buffer, width * height * sizeof( uchar4 ), cudaMemcpyDeviceToHost ) );

with the same size as the for the device buffer

    state.params.width  = 48u * 2u;
    state.params.height = 32u * 2u;
    ...
    CUDA_CHECK( cudaMalloc( reinterpret_cast<void**>( &state.params.frame_buffer ),
                            state.params.width * state.params.height * sizeof( uchar4 ) ) );

You would just need to change the type to float4 in your case.

how can I get the actual data from the device in order to apply some math calculations.

Once you get everything working, depending on what calculations you need to do, it might be a lot faster to do these calculations with native CUDA kernels on the device data.
An example which uses CUDA kernels to generate rays and calculate shading on the intersection data can be found inside the optixRaycasting example which implements a wavefront rendering approach.