code instruction cache?

wlangdon · July 30, 2015, 1:01pm

A quick hunt did not reveal anything recent or definitive on the GPU code cache.
For example do we know how big it is?
For a kernel source of a few hundred lines, is it all going to fit into the i-cache?
I guess system library routines must also fit into the i-cache?
Is there just one or does each SMX have its own?
Am I right in assuming the cache for programs is totally separate from the various
data caches? Or are there opportunities for trading code and data caches against
each other?

Thank you

Bill

Robert_Crovella · July 30, 2015, 2:59pm

There is an instruction cache. Each SM has one. The details of it are unpublished, AFAIK, which is why you’re having trouble locating the details. The instruction cache is depicted as a separate entity that is a per-SM resource, for example on p.8 of the Fermi whitepaper:

[url]http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf[/url]

but there is essentially no mention of it elsewhere in the document.

Clochette · July 30, 2015, 3:06pm

This might help you or it might not.

http://www.stuffedcow.net/research/cudabmk
http://www.eecg.toronto.edu/~myrto/gpuarch-ispass2010.pdf

Dude1205 · July 30, 2015, 3:12pm

Anyone knows the effect of running different kernels concurrently on the instruction cache?
Can two blocks from different kernels be scheduled on the same SM at the same time?

Thanks.

little_jimmy · July 30, 2015, 3:55pm

there is a per sm limit on the number of concurrent blocks; there is a per device limit on the number of kernels; both stated in the pg

if i am not mistaken, (some of) the kernel code is also loaded into global memory
so the instruction cache may likely be just sufficient to cache instruction loading from global memory, given the max number of blocks per sm

allanmac · July 30, 2015, 3:55pm

This gem of a paper discusses the effect of uber-kernels on the instruction cache (Figure 9):

Singe: Leveraging Warp Specialization for High Performance on GPUs

Note that the paper might be conflating max # of resident blocks per SM with i-cache limitations.

NVIDIA knows for sure. :)

njuffa · July 30, 2015, 4:21pm

The size of the per-SM instruction cache can be determine through a microbenchmark that uses a loop of increasing size: there is a small but measurable drop in execution speed once the loop body exceeds the Icache size. I performed such an experiment in the past, and from my recollection the Icache size was 4 KB, but I don’t recall what part I measured on (most likely a K20) and the size may easily differ between different architectures.

The GPU instructions in general are 8 bytes long, and for a Maxwell (sm_5x) architecture one can easily see from a disassembled binary that there are an additional 8 bytes of control information added for every three actual instructions. So a 4 KB Icache would hold 384 instructions for an sm_5x part. In light of aggressive inlining by the compiler, loop bodies for various real-life scenarios can exceed this size. In my (pre-Maxwell) experience the performance penalty on a loop that exceeds the ICache was never larger than about 3%.

So I personally do not worry about Icache misses. As with other stall events on the GPU a large number of threads running with zero-overhead context switch are generally able to cover the latency well.

It is unclear what kind of trade-offs between Icache and Dcache you are thinking of. Something like switch statements versus function pointers? Recomputation versus lookup tables? There are other mechanisms that impact those decisions that are likely higher impact, such as branch divergence and serialization.

Robert_Crovella · July 30, 2015, 4:22pm

Yes. If you want a reference for this affirmation, take a look at slide 19 in this presentation:

http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf

“Warps can come from different threadblocks and different concurent kernels”

Dude1205 · July 30, 2015, 4:31pm

@allanmac This is a very interesting paper indeed, thanks.

njuffa · July 30, 2015, 5:14pm

@allanmac: I think it is highly unlikely that Sean Treichler would conflate different architecture mechanisms in a GPU :-) [url]https://research.nvidia.com/users/sean-treichler[/url]

allanmac · July 30, 2015, 5:21pm

Ha!

Then the authors should be less coy about what’s actually happening under the hood!

:)

scottgray · July 31, 2015, 12:25am

I’ll back up njuffa here. I also did as he did but with my assembler controlling the exact number and type of instructions (I also now use this code to probe the hardware on a number implementation details). I measured it on Maxwell (sm_50, I haven’t measured sm_52 yet) as being 8K. That means you can have a stall-less loop of 768 instructions with 256 control codes in between. Though if you want to avoid instruction cache stalls your loop will likely have to be smaller because the start of the loop is probably not going to be aligned with the starting address of the cache.

Typically it’s not something you need to be worried about unless you’re working at very low occupancy (which I happen to do a lot).

njuffa · July 31, 2015, 12:31am

Disassembly of code compiled for sm_5x suggests that the CUDA compiler may make an effort to align loops (by inserting a bunch of NOPs) on that architecture, although I have not experimented extensively enough to say how reliably it does so or under what conditions. When in doubt, cuobjdump is your friend. Good to hear the size of the instruction cache was bumped to 8K.

Topic		Replies	Views
"Instruction Fetch" in Nsight Performance Analysis CUDA Programming and Performance	8	2502	January 7, 2016
How can I tell whether my kernel will thrash the instruction cache? CUDA Programming and Performance	4	610	August 21, 2022
Warp scheduling - have I got this right? CUDA Programming and Performance	17	12170	February 12, 2013
What can be learned from IPC (via nvprof)? CUDA Programming and Performance	9	3184	July 13, 2018
Instruction cache and instruction fetch stalls CUDA Programming and Performance	2	1874	June 26, 2019
Things related to stall reasons... or not so related CUDA Programming and Performance	6	2003	April 14, 2017
a deep dive into Instruction-level parallelism CUDA Programming and Performance	17	5069	December 18, 2018
CUDA on G80 hardware questions... Mapping the execution model to hardware CUDA Programming and Performance	10	12410	April 10, 2008
Latency and low-level performance questions CUDA Programming and Performance	10	4289	June 23, 2015
Maximum number of instruction inside a Kernel CUDA Programming and Performance	9	2814	October 13, 2009

code instruction cache?

Related topics