Any information on GPU on-die memory architecture?

I’ll be teaching a class on parallel computing this fall, and wanted to include GPGPUs in it. So I’m trying to learn the subject myself :-).

I’ve found no information at all about how GPUs connect their memory to their cores. So for, e.g., Pascal, you have 8 memory controllers; each controls .5MB of L2 cache, and each pair controls one HBM2 stack.

Questions: what does it mean for each memory controller to also control .5MB of L2? Do the 8 pieces of L2 each own a non-overlapping part of the GPU physical-address space? May two pieces of L2 each own the same line at any given time? Is there an equivalent of MESI somewhere, or is this all non-coherent?

In addition to the 8 memory controllers, there are 60 streaming multiprocessors (SMs).

Questions: how are the memory controllers, caches and SMs interconnected? Is there a ring-based structure? A subset of a crossbar? Something else? Link-based interconnect? Is this information purposely not disclosed?



First, let me say that AFAIK, none of this is published to the level of detail that you are asking. So it’s possible I may make some errors.

In general, in GPUs, DRAM is partitioned. Each memory controller handles a DRAM partition. There is a translation from GPU logical linear memory space to a map that defines where each byte comes from (i.e. which partition, and from which segment and where in that segment, in that partition.)

As far as I know, this translation scheme is not published, and it does change from GPU architecture to GPU architecture, in some cases. With sufficient reverse engineering, this might be something that is discoverable, but as far as I know, the discoverability has become harder since Fermi GPUs and beyond.

A given byte in the GPU logical (programmer’s perspective) linear address space therefore maps to a single partition. Each portion of L2 cache in each memory controller is only responsible for handling the L2 requests for data associated with that partition. Therefore, it is not possible for a data item to belong to a line in more than one L2 “piece”.

So the answer to the second question I have quoted above is “Yes”, and the answer to the third question is “No”. Although a L2 “line” at this level does fit the definition of “line”, it may not fit the definition of what you have in your head for what that line represents in terms of the mapping from logical linear space.

For all of your questions, the information is not made available in anything like a hardware description that I have ever come across. It’s obvious to me you’ve already read the Pascal whitepaper, and I don’t know of a more comprehensive description than that. Occasionally you will find NVIDIA people more knowledgeable than I sharing some details informally, and you may come across reverse-engineering technical papers that shed some light on it.

Thanks for the info. I kind of suspected that the information was probably not released, but now I know for sure :-)


Hi Joel,

I know this is a late reply, but you might find all the information you are seeking via this paper ( (Open source link: )

Approx. half of the paper deals with how to reverse engineering NVIDIA GPUs (Taking Pascal and Volta architecture as examples, specifically GTX 1080, GTX 1070 and Tesla V100) It goes specifically into the mapping between L2 cache and DRAM, apart from revealing various other details via reverse engineering. Look at sections “Reverse-Engineering of DRAM Bank Addressing”, “Reverse-Engineering of L2 Cache set Addressing”, “GPU Memory Hierarchy”

If you are further interested, the code used for reverse enigneering is available at

Hope this information will be helpful for you (Hopefukly you are still teaching the course on parallel computing)

If you need have any questions, you can contact me (My email id is there in the paper). I welcome all feedback

Thanks, Saksham. I’ll read your paper & let you know if I have any questions.