High Stall MIO Throttle

I’m analyzing my program in Nsight Compute and my warps are experiencing stalls due to MIO Throttle and Long Scoreboard. In the context of an optix program could MIO Throttle be warps waiting for response from RT Cores? What are some ways to go about decreasing these stall times.

That happens when you’re overloading the memory IO FIFO.
https://docs.nvidia.com/drive/archive/drive_os_5.1.12.0L/nsight-graphics/activities/index.html#shaderprofiler_stallreasons
https://stackoverflow.com/questions/66233462/when-does-mio-throttle-stall-happen

Data dependency for RT cores should appear in the short scoreboard.

Since you’re comparing two different Nsight Compute runs in that image, you seemed to have changed something which worked better before (the green bar)? Or did that just do less work?

Check the memory bandwidth overview as well to see if there have been changes to the memory accesses between these two runs.

Why do you think it’s waiting on the RT cores?
You should look at the source code view (CUDA and SASS) to see where you’re spending the time inside your kernel. If it’s waiting on the RT cores there should be some stalls after the optixTrace calls.

Without knowing what the program does (aside from guessing from your other questions on this forum) or what the differences between the two compared runs are, or what the system configuration is, there is nothing to work with to tell you what you need to change to make your MIO stalls disappear.

Do this to optimize ray tracing kernels:

  • Never ever benchmark or profile debug device code or with disabled compiler optimizations. Only profile fully optimized release device code!
  • Look into the Nsight Compute source view (CUDA C++ (or PTX) and SASS) and check which statements are the bottlenecks.
    https://raytracing-docs.nvidia.com/optix8/guide/index.html#program_pipeline_creation#program-input
  • Reduce memory accesses.
  • Make memory accesses faster by using vectorized load and store instructions (prefer 2- and 4-component vectors, 3-component vectors are loaded as three scalars).
  • Make memory accesses faster by accessing cache lines better. (Think of 32 bytes and 64 bytes per load/store granularity.)
  • Reduce live variables around optixTrace and optixDirect/ContinuationCall functions inside your OptiX device code.
  • Reduce the OptiX stack size. (Use iterative over recursive algorithms.)
  • And all other things to optimize C++ code.

Thanks for the extremely detailed response, let me try to give more details to the best of my ability.

Look into the Nsight Compute source view (CUDA C++ (or PTX) and SASS) and check which statements are the bottlenecks.

So regarding MIO, the source level view of my device code did not show a single line as a source for MIO stalling, however I found multiple lines which were a source for long scoreboard stalling which I plan to address.

Reduce memory accesses.

Our optimizations actually significantly reduced number of memory access as shown by the instruction stats below. (Blue bar is optimized, Green bar is original)

Make memory accesses faster by using vectorized load and store instructions (prefer 2- and 4-component vectors, 3-component vectors are loaded as three scalars).
Make memory accesses faster by accessing cache lines better. (Think of 32 bytes and 64 bytes per load/store granularity.)

Yep will definitely look into doing this, thanks!

Reduce live variables around optixTrace and optixDirect/ContinuationCall functions inside your OptiX device code.

Would looking at the register usage in source view around these calls be a good place to start?

Overall we have noticed that our program actually runs 2x-3x when comparing the program represented by the blue bar to the program run by the green bar. However this could mainly be due to the 70% reduction in instruction executed because of our optimizations. However the large increase in warp stalling is concerning and we are having trouble identifying the source.

Here is the memory workload analysis also showing the large drop in memory requests.

Excellent improvement!

The memory bandwidth looks fine. No host memory accesses, very few device memory accesses, lots of L1 and L2 cache hits.

All the other instruction reductions are awesome. (I assume the IADD reduction comes from a lot less pointer chasing inside the code?)

I can’t say what else could be improved to reduce MIO stalls. Maybe you’re at the peak of what your GPU can handle which I can’t say because you still didn’t provide information about your system configuration.

The overall memory throughput of 2.54 GB/sec doesn’t sound high enough to saturate any current GPU.

Your average active warps per thread of 27.6 (of 32 at maximum) in the original report are also looking OK.
You wouldn’t reach that value with a path tracer, for example, where the ray divergence kills the occupancy. That’s why Shader Execution Reordering is a thing.

Maybe the memory access pattern isn’t good enough, like gathering data from too many different addresses instead of hitting cache lines nicely.
Memory layout changes to Structure Of Array can help there, but only if the kernels walk along these pointers instead of jumping around.
But it also makes sense to keep things interleaved in arrays of small structures if all data of the structure is needed anyways (e.g. vertex attributes). Still, read vectorized data types in order of increasing memory addresses.
Make sure structure fields are ordered by their CUDA alignment requirements from big to small to have no unnecessary padding by the compiler between the fields.

Live variables are simply the variables of which you need the data before and after an OptiX device call like optixTrace or a callable. Sometimes it’s faster to recalculate a value instead of keeping it live.

The number of registers used inside the OptiX kernel can be controlled with the OptixModuleCompileOptions maxRegisterCount value, where the OPTIX_COMPILE_DEFAULT_MAX_REGISTER_COUNT is usually using 128 registers of the 255 available (one is reserved).

The number of registers affects how many blocks are launched. There is a graph inside Nsight Compute which shows that relation. It’s like a stair step. Which values make sense, can be seen inside that graph.

So if you see that you’re using a lot fewer registers than 128, like 64 or 32, it makes sense to try setting maxRegisterCount to a smaller value and benchmark that.
On the other hand, if you see register spilling (lots of STL/LDL calls around your OptiX calls inside the SASS code) which can come from live variables, then it might make sense to raise that maxRegisterCount to 255 and benchmark that. Of course reducing the live variables should still be tried. There is a cliff at 128, so when going over that, there isn’t much difference between 192 or the maximum of 255.
If the performance doesn’t change and the SASS code doesn’t show reduced memory accesses, then just keep maxRegisterCount = OPTIX_COMPILE_DEFAULT_MAX_REGISTER_COUNT.