How can I tell whether my kernel will thrash the instruction cache?

In my CUDA programming experience, I have learned to track register usage, manage L1 memory (that may be 75% of my job, in a sense), and block granularity. But I have not given much thought to the instructions cache, which someone mentioned in one of my other recent posts. I have a pretty big kernel (various incarnations of it are likely 1000+ lines), and I am looking to add more code via #include "new_code.cui" (I just call CUDA included files .cui–not for love of edible Peruvian rodents).

Can anyone suggest a means to assess, during compilation, whether a kernel threatens to overwhelm (and, I would imagine, thrash) the instructions cache? What sorts of performance degradation can I expect if my kernel starts doing that?


By definition, cache thrashing is an interaction between running code and hardware. I don’t know how one could diagnose that by static analysis, whether it be thrashing of data cache or instruction cache.

Check the CUDA profiler documentation whether it has instruction cache metrics available, or inquire about relevant metrics in the profiler sub-forum.

From the Kepler architecture time frame I recall a case of capacity-based misses due to the body of an innermost loop spanning more memory than the size of the instruction cache. From memory: the performance penalty was 3%.

I would be surprised to find a case of significant performance impact from instruction cache misses in a CUDA application, but I do not see how the possibility can be excluded, especially in programs with a lot of control transfers.

My, (limited), experience on Pascal has shown a low single digit percentage impact, which can be seen in profiler instruction fetch latency data.

An outdated, (in the GPU architecture sense), observation, from Scott Gray:

“Our unroll factor is the number of lines we read at a time from A and B, store/read from shared, and compute. It will be constrained in several ways. We want to be able to hide the latencies of our texture loads by as much computational work as possible. However we don’t want the size of the loop to exceed the size of the instruction cache. Doing so would add additional instruction fetch latencies that we’d then need to hide. On Maxwell I’ve measured this cache to be 8KB. So this means we don’t want a loop size of more than 1024 8 byte instructions, where every 4th instruction is a control code. So 768 is the limit of useful instructions. Then there are also instruction alignment considerations so you’ll want to be safely under that value as well. In short, using a loop unroll factor of 8 gives us 8 x 64=512 ffma instuctions plus the additional memory and integer arithmetic instructions needed for the loop (about 40). This puts us well under 768. Eight lines per loop also aligns well with the dimensions of our texture memory loads. Finally, 512 FFMAs should be enough to mostly hide the 200+ clock texture load latency.”

More modern GPUs have significantly larger I caches - see table 3.1 here.

Well, instruction caches sizes certain had to increase in newer GPU architectures given that each instruction takes up 16 bytes these days: 8 bytes for the instruction proper plus 8 bytes for the associated “op-steering” control block.

Looping is not the worst case scenario for instruction cache misses. I am hypothesizing a worst case scenario would involved kernels with multiple called subroutines with cache conflicts between them based on their addresses. I don’t know anything about the associativity of the instruction caches on GPUs, but it is reasonable to assume it is low and these caches may even be direct mapped. Typical kernels do not require high associativity, and there is hardware expenditure for set associative caches, with those transistors probably better spent to increase performance through other means.

Thanks as always, Norbert! This makes me feel a lot better. I don’t have loops nearly that big, and while I do have multiple code pathways separated by a switch, the various pathways that could run simultaneously are already in the kernel. Further code that will go in it is to be added after successive __syncthreads(); calls, so if the instructions cache then the new features will not compete with old ones for the cache space. So many things to think about, but it’s good to know that this is not really one of them.