High Stall MIO Throttle

droettger · December 7, 2023, 7:43am

Excellent improvement!

The memory bandwidth looks fine. No host memory accesses, very few device memory accesses, lots of L1 and L2 cache hits.

All the other instruction reductions are awesome. (I assume the IADD reduction comes from a lot less pointer chasing inside the code?)

I can’t say what else could be improved to reduce MIO stalls. Maybe you’re at the peak of what your GPU can handle which I can’t say because you still didn’t provide information about your system configuration.

The overall memory throughput of 2.54 GB/sec doesn’t sound high enough to saturate any current GPU.

Your average active warps per thread of 27.6 (of 32 at maximum) in the original report are also looking OK.
You wouldn’t reach that value with a path tracer, for example, where the ray divergence kills the occupancy. That’s why Shader Execution Reordering is a thing.

Maybe the memory access pattern isn’t good enough, like gathering data from too many different addresses instead of hitting cache lines nicely.
Memory layout changes to Structure Of Array can help there, but only if the kernels walk along these pointers instead of jumping around.
But it also makes sense to keep things interleaved in arrays of small structures if all data of the structure is needed anyways (e.g. vertex attributes). Still, read vectorized data types in order of increasing memory addresses.
Make sure structure fields are ordered by their CUDA alignment requirements from big to small to have no unnecessary padding by the compiler between the fields.

Live variables are simply the variables of which you need the data before and after an OptiX device call like optixTrace or a callable. Sometimes it’s faster to recalculate a value instead of keeping it live.

The number of registers used inside the OptiX kernel can be controlled with the OptixModuleCompileOptions maxRegisterCount value, where the OPTIX_COMPILE_DEFAULT_MAX_REGISTER_COUNT is usually using 128 registers of the 255 available (one is reserved).

The number of registers affects how many blocks are launched. There is a graph inside Nsight Compute which shows that relation. It’s like a stair step. Which values make sense, can be seen inside that graph.

So if you see that you’re using a lot fewer registers than 128, like 64 or 32, it makes sense to try setting maxRegisterCount to a smaller value and benchmark that.
On the other hand, if you see register spilling (lots of STL/LDL calls around your OptiX calls inside the SASS code) which can come from live variables, then it might make sense to raise that maxRegisterCount to 255 and benchmark that. Of course reducing the live variables should still be tried. There is a cliff at 128, so when going over that, there isn’t much difference between 192 or the maximum of 255.
If the performance doesn’t change and the SASS code doesn’t show reduced memory accesses, then just keep maxRegisterCount = OPTIX_COMPILE_DEFAULT_MAX_REGISTER_COUNT.

Topic		Replies	Views
Memory Workload Analysis related metrics Nsight Compute	1	1860	January 30, 2020
Reasons for encountering stalls of type "misc" Nsight Compute	2	842	January 20, 2020
Ray-local memory OptiX	20	1390	June 14, 2022
Stall reasons summation is not 100% Nsight Compute	7	980	October 12, 2021
Improving GPU Performance by Reducing Instruction Cache Misses Technical Blog	5	40	August 9, 2024
Nsight Compute: discrepancy in cache reports for OptiX applications Nsight Compute	8	607	July 13, 2021
SASS for F2I and I2F conversions Nsight Compute cuda , nsight , profiling	3	765	January 2, 2023
Waiting for global memory access. CUDA Programming and Performance	32	56324	January 31, 2008
How to know my kernel if Pipeline parallel by nsight compute Nsight Compute	6	816	April 18, 2023
Kernel with very low eligible warps despite fully coalesced memory access CUDA Programming and Performance	7	947	July 17, 2023

High Stall MIO Throttle

Related topics