I’m a newbee on optix and ray-tracing. I want to rasterize 310^8 triangles through the function of ray tracing with OPTIX-7.7. The image is about 410^10 pixels, and the time seems unacceptable, it takes almost 10 minutes.
Then I tried only 10 triangles with the same size image, the time didn’t change. I used nsight-system to analyze the code, optixLaunch and cudaSynchronize took 99% time. I am confused, the time for data transmission is ablout several seconds, why it takes so long, and how can I optimize it?
Any help could be appreciated, and sorry for my poor English.
My environment: CUDA 12.2 / Ubuntu 22.04 / Optix 7.7.0 / Cuda Driver 535.113.01
Could you please clarify the two values 310^8 and 410^10?
Do you really mean 310 to the power of 8 and 410 to the power of 10?
Those are astronomical numbers and I cannot even imagine how to store that amount of data.
I mean, if the data is empty, optixLaunch takes about 13060ms with a large size 1024^3(width height depth), is it possible to optimize it? I might have to relaunch 30 times.
Ok, so it’s 3 * 10^8 and 4 * 10^10 which are still huge.
1.)
There is an OptiX launch size limitation of 2^30.
4 * 10^10 == 40,000,000,000
2^30 == 1,073,741,824
so you would need at least 38 launches to render that image size if the GPU could handle the amount of data.
With a launch dimension of 1024x1024 you would need 38,147 launches.
2.)
3 * 10^8 = 300,000,000
Are these 300 MTriangles individual triangles or if there any geometry instancing happening?
I’m assuming these are individual triangles.
The limit of primitives per geometry acceleration structure (GAS) in OptiX is 2^29 == 536,870,912 so technically it’s possible to fit these into a single GAS, but the amount of temporary memory required to build that would most likely exceed the capacity of your RTX 3060.
Means to make this work, the triangles would need to be split into multiple GAS and put under a top-level instance acceleration structure (IAS) using the OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING which is fully hardware accelerated on RTX boards.
You should also use acceleration structure compaction on that data after each the initial GAS build.
I cannot say if 13060 ms per launch with a 2^30 launch dimension is reasonable without knowing what happens inside your ray tracing implementation.
Your RTX 3060 is an entry level board and with that timing a 1024x1024 image would take about 12.75 ms which can either be reasonable or not. It completely depends on what you’ve implemented inside your ray tracing pipeline and what the memory accesses and resulting data sizes are you need to transfer. I can’t answer that with the given information.
Thank you for your calculation.
Yes, I use IAS with several GAS. The result 13.060s I got was from the example optixTriangle inside the OPTIX-7.7.0-SDK, I only changed the width(1024x32) and height(1024x32).
and added this benchmark code around the optixLaunch call:
// ===== Benchmark a single launch.
CUDA_SYNC_CHECK();
Timer launchTimer;
OPTIX_CHECK( optixLaunch( pipeline, stream, d_param, sizeof( Params ), &sbt, width, height, /*depth=*/1 ) );
CUDA_SYNC_CHECK();
// ===== Benchmark a single launch
std::cout << "optixLaunch+sync = " << launchTimer << " seconds\n";
and removed the display and file save code of the resulting image after that.
Then I called that executable multiple times with the two command lines:
optixTriangle.exe --dim=1024x1024
optixTriangle.exe --dim=32768x32768
and the best results were 0.000388 seconds resp. 0.0365543 seconds on an RTX 6000 Ada.
Means 13 seconds on an RTX 3060 for a single launch of the optixTriangle example at 32768x32768 resolution seems unreasonable. That’s 357 times slower than what I measured.
When benchmarking with CPU timers like the code above, did you add a CUDA synchronization command before starting your timer and after the optixLaunch? (See the CUDA_SYNC_CHECK() before the Timer.)
All OptiX API calls taking a CUDA stream argument, like optixAccelBuild and optixLaunch, are asynchronous CUDA kernels which run asynchronously to the CPU.
The OptiX SDK examples enable debug options inside the OptiX device code translation and the OptixModuleCompileOptions when building Debug targets. That makes the kernels ridiculously slow. (My own OptiX examples don’t do that.)
Running the optixTriangle.exe --dim=32768x32768 in Debug takes 2 seconds instead if the 0.0365543 before.
Also once you arrived at more reasonable numbers, a lot of time will also be spent on reading the device data to the host. Benchmark that individually.
Make sure that the GPU is installed inside a PCIe slot with 16 lanes (electrical) and uses them all.
If your system is a dual-CPU system, make sure the process is running on the CPU to which the GPU is attached, otherwise there can be unnecessary inter-CPU transfers (e.g. QPI).