Issue about iCache

half-0 · September 3, 2025, 3:04am

Now, many works are about generating a mega kernel. It will cause a large code size and iCache miss (because of a lot of branches). How much space does iCache have per SM on A100? How can we observe the iCache hit ratio? Is there any way to mitigate the overhead caused by iCache misses, like storing some instructions in dCache and loading them asynchronously (assume that we can’t shorten the code)?

Curefab · September 3, 2025, 7:54am

It is given as stall reason in Nsight Compute IIRC.

Try to avoid a mega kernel by combining more flexible code in a loop with control data that you can load over the data path. You kind of emulate the large kernel. The emulator itself would be small.

If that is not possible try to process as much as data as possible with the same or similar branches to improve the ratio of data processed per instruction load.

If that is not possible you just have to accept that that approach is not very performant.

half-0 · September 3, 2025, 8:01am

I’ sorry. I don’t understand this sentence completely. Could you please give me an example?

Curefab · September 3, 2025, 8:19am

You define the base operations or building blocks your kernel does, e.g. (yours can have different ones) addition and multiplication. As data you operate on addressable memory (either shared memory and/or local/global memory with L2 cache and/or a small subset of registers, which you select between).

Then you have a large for loop. Each iteration is one virtualized instruction. You load from data memory the next instruction (in one variable you store a virtual instruction counter), the source and target operands’ indices and the operator. And inside the for loop the source operands are loaded. A switch case selects the operation and you store in the target.

The building blocks can be simple functions (addition, multiplication, conditional jump) or complex operations (FFT, sort), you want few building blocks being repeated again and again inside your previously large kernel so that the emulator is very small.

If done well, the overhead is still huge, but less than for a mega kernel.

One variant is that a warp cooperates for a single virtual either more complex or SIMD instruction. This speeds up memory accesses and reduces warp divergence.

rs277 · September 3, 2025, 7:21pm

The L0 instruction cache is 32KiB per block, per SM, according to page 32 of this GTC presentation, with a further 128KiB shared with the constant cache across the SM.

If you’ve not already seen it, this blog post may be helpful.

As outlined in the above blog, excessive misses show up as “Stall No Instruction” entries.

Topic		Replies	Views
code instruction cache? CUDA Programming and Performance	12	4667	July 31, 2015
Instruction cache size for Ampere and Volta Arch nvc, nvc++ and nvfortran	2	1198	April 28, 2023
Improving GPU Performance by Reducing Instruction Cache Misses Technical Blog	7	98	April 22, 2025
Simple caching kernel yields low performance CUDA Programming and Performance	2	631	June 4, 2015
Improving GPU Performance by Reducing Instruction Cache Misses Technical Blog	4	712	April 22, 2024
Instruction Cache CUDA Programming and Performance	1	4618	January 19, 2012
"Instruction Fetch" in Nsight Performance Analysis CUDA Programming and Performance	8	2544	January 7, 2016
How can I tell whether my kernel will thrash the instruction cache? CUDA Programming and Performance	4	653	August 21, 2022
kernel size and caching CUDA Programming and Performance	3	7946	May 6, 2007
Find out more opportunities for accelerating SpMM using sparse tensor cores CUDA Programming and Performance cuda , kernel	5	488	March 24, 2024

Issue about iCache

Related topics