Rendering for multiple predefined camera locations

Hello,

I have created a renderer to render image with given camera parameters. Suppose I have huge amount of predefined camera positions e.g., a million. Are there better ways to render for each camera position than doing this in an iterative for loop?

The code I have now is basically:

for (auto camera& : cameras) {
        renderer->resize(fbSize);
        renderer->setCamera(camera);
        renderer->render();
        std::vector<uint32_t> pixels(fbSize.x*fbSize.y);
        renderer->downloadPixels(pixels.data());
}

But I feel the data IO and the too many optixLaunch could be a bottleneck that slow down the process. Or it is not?

Thanks

It depends on how big your frame buffers are and what you’re doing with the data after the downloadPixels() in that loop.

1.) First you would need to determine what the bottleneck actually is.
Are you limited by how long the ray tracing takes, or are you limited by the device-to-host data transfer of the pixels and whatever you do with them afterwards inside that loop. Nsight Systems can determine that.
If you’re unable to do anything asynchronously with the pixels data while the next image gets rendered on the GPU, you’re going to be limited by device-to-host data transfers.

2.) If the fbsize isn’t changing per camera, you can move these lines before the loop.

std::vector<uint32_t> pixels(fbSize.x*fbSize.y);
renderer->resize(fbSize);

3.) I assume the renderer->setCamera(camera); updates the single camera information inside the launch parameters.
You could instead upload the information of that whole camera array into a CUDA device buffer and put that device pointer and number of cameras into the launch parameters outside the loop, and change your render() call to take a camera index as argument.

4.) If the fbsize is comparably small, like 256x256 or even smaller, and your ray tracing algorithm is rather fast, high-end GPUs could not be fully saturated by that launch dimension.
If you placed many of these cameras as tiles into a single bigger framebuffer, e.g. for 256x256 per camera image, you could render 100 cameras at once into a 2560x2560 buffer in a single optixLaunch instead.
That would later require some more logic to split these again to get individual images on the host, but that could happen on the CPU on the copied pixels while the GPU renderer works on the next cameras already.

5.) Or similarly to the tiles approach but with more asynchronous operations, when allocating room for multiple camera output images on the device and on the host, you could render multiple cameras with (asynchronous) optixLaunch calls and asynchronous memory copies into the same CUDA stream.
For that each optixLaunch would need to know which camera it renders and how many optixLaunch calls are done per render() call to be able to calculate its output buffer pointer.
I think when inserting CUDA events after the asynchronous copies, the CPU could also be triggered to start working on already finished pixel data.

So it really depends on what workload you’re talking about.
If this is doing something like rendering 4K images and saving them to disk, then you’re data transfer limited anyway and that could be alleviated a little by double buffering the rendering and pixel copies maybe.

1 Like

Thank you!

These are really nice suggestions! I will try to implement them.

The image size is indeed very small (even smaller than 256x256) so I think your answers will help me improve the speed.

I am also interested in option 5, which is the tiles approach but with asynchronous operations. But I am not very familar with the streaming operations with CUDA or optix. Do you happen to know any examples that have used similar techniques like the optix SDK examples so I can learn from?

Best

All OptiX API calls which take a CUDA stream are asynchronous, most notably the optixAccelBuild and for your case the optixLaunch call.

I’m not using CUDA events in my examples, but I’m launching many optixLaunch calls into one stream for a progressive path tracer in this example:
https://github.com/NVIDIA/OptiX_Apps/blob/master/apps/GLTF_renderer/Application.cpp#L2004
That is rendering into only one output buffer with accumulation so I only need to update the sub-frame iteration index inside the launch parameters (which have been uploaded completely before that loop).
That upload of the iteration index happens asynchronously.
I just need to wait on all optixLaunches to have finished with the cudaDeviceSynchronize() at the end and can then copy the single output buffer.

Similarly to that cudaMemcpyAsync() for the launch parameter update from host to device, you could also use cudaMemcpyAsync() for the device-to-host copies for each of your camera images. That requires different destination buffers on host!
But if you want to start with work on the CPU once the asynchronous copy of the image has finished, you would need to signal that to the CPU with a CUDA event:
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EVENT.html

Here’s an old blog post about synchronizations and events and how to overlap data transfers with kernels.
https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/

Please have a look through the CUDA samples here: https://github.com/NVIDIA/cuda-samples
Looks like the simpleMultiCopy is using events for asynchronous copy benchmarking.

Also when each optixLaunch dimension is really small, running them asynchronously inside a single stream would not help with saturating the GPU. There is a minimum launch dimension below which not all compute resources can be used to 100 percent. Rendering many camera images into a bigger buffer with a single optixLaunch would be the better approach then.
When using multiple streams, don’t overdo it. More than 8 CUDA streams won’t help with scaling.

1 Like

Thank you for your answers! They are very helpful!