How can I tell whether my kernel will thrash the instruction cache?

dscerutti · August 21, 2022, 6:41am

In my CUDA programming experience, I have learned to track register usage, manage L1 memory (that may be 75% of my job, in a sense), and block granularity. But I have not given much thought to the instructions cache, which someone mentioned in one of my other recent posts. I have a pretty big kernel (various incarnations of it are likely 1000+ lines), and I am looking to add more code via #include "new_code.cui" (I just call CUDA included files .cui–not for love of edible Peruvian rodents).

Can anyone suggest a means to assess, during compilation, whether a kernel threatens to overwhelm (and, I would imagine, thrash) the instructions cache? What sorts of performance degradation can I expect if my kernel starts doing that?

Cheers,
Dave

njuffa · August 21, 2022, 6:47pm

By definition, cache thrashing is an interaction between running code and hardware. I don’t know how one could diagnose that by static analysis, whether it be thrashing of data cache or instruction cache.

Check the CUDA profiler documentation whether it has instruction cache metrics available, or inquire about relevant metrics in the profiler sub-forum.

From the Kepler architecture time frame I recall a case of capacity-based misses due to the body of an innermost loop spanning more memory than the size of the instruction cache. From memory: the performance penalty was 3%.

I would be surprised to find a case of significant performance impact from instruction cache misses in a CUDA application, but I do not see how the possibility can be excluded, especially in programs with a lot of control transfers.

rs277 · August 21, 2022, 7:55pm

My, (limited), experience on Pascal has shown a low single digit percentage impact, which can be seen in profiler instruction fetch latency data.

An outdated, (in the GPU architecture sense), observation, from Scott Gray:

“Our unroll factor is the number of lines we read at a time from A and B, store/read from shared, and compute. It will be constrained in several ways. We want to be able to hide the latencies of our texture loads by as much computational work as possible. However we don’t want the size of the loop to exceed the size of the instruction cache. Doing so would add additional instruction fetch latencies that we’d then need to hide. On Maxwell I’ve measured this cache to be 8KB. So this means we don’t want a loop size of more than 1024 8 byte instructions, where every 4th instruction is a control code. So 768 is the limit of useful instructions. Then there are also instruction alignment considerations so you’ll want to be safely under that value as well. In short, using a loop unroll factor of 8 gives us 8 x 64=512 ffma instuctions plus the additional memory and integer arithmetic instructions needed for the loop (about 40). This puts us well under 768. Eight lines per loop also aligns well with the dimensions of our texture memory loads. Finally, 512 FFMAs should be enough to mostly hide the 200+ clock texture load latency.”

More modern GPUs have significantly larger I caches - see table 3.1 here.

njuffa · August 21, 2022, 9:31pm

Well, instruction caches sizes certain had to increase in newer GPU architectures given that each instruction takes up 16 bytes these days: 8 bytes for the instruction proper plus 8 bytes for the associated “op-steering” control block.

Looping is not the worst case scenario for instruction cache misses. I am hypothesizing a worst case scenario would involved kernels with multiple called subroutines with cache conflicts between them based on their addresses. I don’t know anything about the associativity of the instruction caches on GPUs, but it is reasonable to assume it is low and these caches may even be direct mapped. Typical kernels do not require high associativity, and there is hardware expenditure for set associative caches, with those transistors probably better spent to increase performance through other means.

dscerutti · August 21, 2022, 10:56pm

Thanks as always, Norbert! This makes me feel a lot better. I don’t have loops nearly that big, and while I do have multiple code pathways separated by a switch, the various pathways that could run simultaneously are already in the kernel. Further code that will go in it is to be added after successive __syncthreads(); calls, so if the instructions cache then the new features will not compete with old ones for the cache space. So many things to think about, but it’s good to know that this is not really one of them.

Topic		Replies	Views
"Instruction Fetch" in Nsight Performance Analysis CUDA Programming and Performance	8	2657	January 7, 2016
Instruction cache and instruction fetch stalls CUDA Programming and Performance	2	2016	June 26, 2019
Instruction cache misses + Local stack and spills CUDA Programming and Performance cuda , kernel	9	246	January 11, 2025
"no instruction" stalls every 256 bytes of the binary code CUDA Programming and Performance	7	1716	February 14, 2019
Kernel max instructions? CUDA Programming and Performance	8	1779	February 8, 2018
Maximum number of instruction inside a Kernel CUDA Programming and Performance	9	2944	October 13, 2009
CUDA instruction cache size? CUDA Programming and Performance	1	1077	April 3, 2012
code instruction cache? CUDA Programming and Performance	12	4870	July 31, 2015
Instruction Cache CUDA Programming and Performance	1	4659	January 19, 2012
NVIDIA people, please pay attention, still have no meaningful answer How to estimate the proximity t CUDA Programming and Performance	5	666	November 5, 2010

How can I tell whether my kernel will thrash the instruction cache?

Related topics