Improving GPU Performance by Reducing Instruction Cache Misses

Originally published at: https://developer.nvidia.com/blog/improving-gpu-performance-by-reducing-instruction-cache-misses/

Instruction cache misses can cause performance degradation for kernels that have a large instruction footprint, which is often caused by substantial loop unrolling.

Thanks for this. If it’s not too much trouble, it would be interesting to have Figures 1 and 2 updated with the results from the optimal loop unroll solution.

Thanks for the feedback. Below are the updated figures for the optimal unrolling factors. In Figure 1 you can see that the critical metric “Stall No Instruction” has reduced significantly and is now virtually the same for all workload sizes. Figure 2 shows that icc misses have gone virtually to zero. Another benefit of optimal unrolling is that the total number of icc requests has gone down, which means that more instructions are fetched from the level-0 instructions cache instead of being requested from icc.
image

Thanks Rob, that’s a great illustration of the solution and for taking
the trouble to respond.

If not already in train, I’m sure others would benefit from seeing
these charts added to the original blog article.