How long optix takes to rasterize large number triangles?

Hello,

I’m a newbee on optix and ray-tracing. I want to rasterize 310^8 triangles through the function of ray tracing with OPTIX-7.7. The image is about 410^10 pixels, and the time seems unacceptable, it takes almost 10 minutes.
Then I tried only 10 triangles with the same size image, the time didn’t change. I used nsight-system to analyze the code, optixLaunch and cudaSynchronize took 99% time. I am confused, the time for data transmission is ablout several seconds, why it takes so long, and how can I optimize it?

Any help could be appreciated, and sorry for my poor English.

My environment: CUDA 12.2 / Ubuntu 22.04 / Optix 7.7.0 / Cuda Driver 535.113.01

In that system configuration, what is your GPU?

Could you please clarify the two values 310^8 and 410^10?

Do you really mean 310 to the power of 8 and 410 to the power of 10?
Those are astronomical numbers and I cannot even imagine how to store that amount of data.

Either would exceed the limits of the number of triangles and the optixLaunch dimensions so what are you doing to even measure that?
https://raytracing-docs.nvidia.com/optix8/guide/index.html#limits#limits

I see it now. It’s 3 * 10^8 and 4 * 10^10?

Yes, the symbol might be illegal.
GPU is NVIDIA GeForce RTX 3060. The size of image is so large so I split it with 1024x1024.

I mean, if the data is empty, optixLaunch takes about 13060ms with a large size 1024^3(width height depth), is it possible to optimize it? I might have to relaunch 30 times.

Ok, so it’s 3 * 10^8 and 4 * 10^10 which are still huge.

1.)
There is an OptiX launch size limitation of 2^30.
4 * 10^10 == 40,000,000,000
2^30 == 1,073,741,824
so you would need at least 38 launches to render that image size if the GPU could handle the amount of data.

With a launch dimension of 1024x1024 you would need 38,147 launches.

2.)
3 * 10^8 = 300,000,000
Are these 300 MTriangles individual triangles or if there any geometry instancing happening?
I’m assuming these are individual triangles.

The limit of primitives per geometry acceleration structure (GAS) in OptiX is 2^29 == 536,870,912 so technically it’s possible to fit these into a single GAS, but the amount of temporary memory required to build that would most likely exceed the capacity of your RTX 3060.

Means to make this work, the triangles would need to be split into multiple GAS and put under a top-level instance acceleration structure (IAS) using the OPTIX_TRAVERSABLE_GRAPH_FLAG_ALLOW_SINGLE_LEVEL_INSTANCING which is fully hardware accelerated on RTX boards.

You should also use acceleration structure compaction on that data after each the initial GAS build.

I cannot say if 13060 ms per launch with a 2^30 launch dimension is reasonable without knowing what happens inside your ray tracing implementation.

Your RTX 3060 is an entry level board and with that timing a 1024x1024 image would take about 12.75 ms which can either be reasonable or not. It completely depends on what you’ve implemented inside your ray tracing pipeline and what the memory accesses and resulting data sizes are you need to transfer. I can’t answer that with the given information.

Thank you for your calculation.
Yes, I use IAS with several GAS. The result 13.060s I got was from the example optixTriangle inside the OPTIX-7.7.0-SDK, I only changed the width(1024x32) and height(1024x32).

I added the Timer structure to the optixTriangle

struct Timer
{
    Timer() { m_start = m_clock.now(); }

    double elapsed() const
    {
        std::chrono::duration<double> e = m_clock.now() - m_start;
        return e.count();
    }

    friend std::ostream& operator<<( std::ostream& out, const Timer& timer ) { return out << timer.elapsed(); }
    std::chrono::high_resolution_clock             m_clock;
    std::chrono::high_resolution_clock::time_point m_start;
};

and added this benchmark code around the optixLaunch call:


            // ===== Benchmark a single launch.
            CUDA_SYNC_CHECK();
            Timer     launchTimer;

            OPTIX_CHECK( optixLaunch( pipeline, stream, d_param, sizeof( Params ), &sbt, width, height, /*depth=*/1 ) );
            CUDA_SYNC_CHECK();
            
            // ===== Benchmark a single launch
            std::cout << "optixLaunch+sync = " << launchTimer << " seconds\n";

and removed the display and file save code of the resulting image after that.

Then I called that executable multiple times with the two command lines:
optixTriangle.exe --dim=1024x1024
optixTriangle.exe --dim=32768x32768

and the best results were 0.000388 seconds resp. 0.0365543 seconds on an RTX 6000 Ada.

Means 13 seconds on an RTX 3060 for a single launch of the optixTriangle example at 32768x32768 resolution seems unreasonable. That’s 357 times slower than what I measured.

When benchmarking with CPU timers like the code above, did you add a CUDA synchronization command before starting your timer and after the optixLaunch? (See the CUDA_SYNC_CHECK() before the Timer.)
All OptiX API calls taking a CUDA stream argument, like optixAccelBuild and optixLaunch, are asynchronous CUDA kernels which run asynchronously to the CPU.

1 Like

I tried the same code as yours, but the result is still unsatisfying.
Perhaps the problem lies in the hardware configuration.

Thank you again for you answer.
Bon weekend!

Did you benchmark in Release mode?

The OptiX SDK examples enable debug options inside the OptiX device code translation and the OptixModuleCompileOptions when building Debug targets. That makes the kernels ridiculously slow. (My own OptiX examples don’t do that.)

Running the optixTriangle.exe --dim=32768x32768 in Debug takes 2 seconds instead if the 0.0365543 before.

Also once you arrived at more reasonable numbers, a lot of time will also be spent on reading the device data to the host. Benchmark that individually.
Make sure that the GPU is installed inside a PCIe slot with 16 lanes (electrical) and uses them all.
If your system is a dual-CPU system, make sure the process is running on the CPU to which the GPU is attached, otherwise there can be unnecessary inter-CPU transfers (e.g. QPI).

1 Like

Hi droettger,

I never noticed this issue before, I will try it.

I tried Release mode, the time reduce to 10s now( size: 4 x 10^10 pixels ).

Thanks for the suggestion!

OK, so 60x faster, from 10 minutes to 10 seconds, when measuring the correct thing. Nice!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.