Reasons for long first run of simple kernel


I’m running multiple kernels step by step in my programm. The first run through this chain lasts about 30 seconds. Running it again the time decreases to about 1 second. This phenomenon is reproducable.
The kernel consuming the most time of the first run is very simple:

RT_PROGRAM void gather()
    float4& data = photons_per_pixel[launch_index.y * launch_dim.x + launch_index.x];
    simulation_buffer[launch_index] = make_float3(data.x / color_factor, data.y / color_factor, data.z / color_factor);

In the chain I have other kernels running on the same dimensions and sizes but doing much more calculations in less time.
So why could 2 lines of code take so much time?


The very first launch needs to build the programs and acceleration structures among other initialization tasks.
You can time that behavior if you do these calls exactly once after you created your OptiX scene:

m_context->launch(0, 0, 0); // Dummy launch to build everything.

Add timers around the code and you’ll see what takes how long.
The next m_context->launch(0, width, height); doing real work should not repeat that first initialization, unless you marked something dirty, changed the scene, added variables, etc.

Thank you for this advice. I’ll test it tomorrow.

Today I was able to speed up me software about 10x by removing some bad recreations of acceleration objects.
Sadly the proposed solution from above has no effect. The launch with “0,0,0” lasts about 30 seconds as like the later first run of my chain.
Any other suggestions?

Hard to say if there is anything more to speed up without additional information.
Please describe the complexity of your scene and CUDA programs.
How long took the two steps validate and compile?

If the majority of the startup time is spent in the first launch(), you would need to investigate how it changes when modifying these parameters:

  • Number of OptiX scene nodes.
  • Number of triangles or other primitives.
  • Acceleration structure builder type.

If it’s limited by the acceleration structure building time and you use the same scene multiple times, you could cache the acceleration structure on disk after the first build and load it when running the application the next time.