Issue about iCache

Now, many works are about generating a mega kernel. It will cause a large code size and iCache miss (because of a lot of branches). How much space does iCache have per SM on A100? How can we observe the iCache hit ratio? Is there any way to mitigate the overhead caused by iCache misses, like storing some instructions in dCache and loading them asynchronously (assume that we can’t shorten the code)?

It is given as stall reason in Nsight Compute IIRC.

Try to avoid a mega kernel by combining more flexible code in a loop with control data that you can load over the data path. You kind of emulate the large kernel. The emulator itself would be small.

If that is not possible try to process as much as data as possible with the same or similar branches to improve the ratio of data processed per instruction load.

If that is not possible you just have to accept that that approach is not very performant.

I’ sorry. I don’t understand this sentence completely. Could you please give me an example?

You define the base operations or building blocks your kernel does, e.g. (yours can have different ones) addition and multiplication. As data you operate on addressable memory (either shared memory and/or local/global memory with L2 cache and/or a small subset of registers, which you select between).

Then you have a large for loop. Each iteration is one virtualized instruction. You load from data memory the next instruction (in one variable you store a virtual instruction counter), the source and target operands’ indices and the operator. And inside the for loop the source operands are loaded. A switch case selects the operation and you store in the target.

The building blocks can be simple functions (addition, multiplication, conditional jump) or complex operations (FFT, sort), you want few building blocks being repeated again and again inside your previously large kernel so that the emulator is very small.

If done well, the overhead is still huge, but less than for a mega kernel.

One variant is that a warp cooperates for a single virtual either more complex or SIMD instruction. This speeds up memory accesses and reduces warp divergence.

The L0 instruction cache is 32KiB per block, per SM, according to page 32 of this GTC presentation, with a further 128KiB shared with the constant cache across the SM.

If you’ve not already seen it, this blog post may be helpful.

As outlined in the above blog, excessive misses show up as “Stall No Instruction” entries.