Any information on GPU on-die memory architecture?

JoelGrodstein · July 23, 2017, 4:52pm

I’ll be teaching a class on parallel computing this fall, and wanted to include GPGPUs in it. So I’m trying to learn the subject myself :-).

I’ve found no information at all about how GPUs connect their memory to their cores. So for, e.g., Pascal, you have 8 memory controllers; each controls .5MB of L2 cache, and each pair controls one HBM2 stack.

Questions: what does it mean for each memory controller to also control .5MB of L2? Do the 8 pieces of L2 each own a non-overlapping part of the GPU physical-address space? May two pieces of L2 each own the same line at any given time? Is there an equivalent of MESI somewhere, or is this all non-coherent?

In addition to the 8 memory controllers, there are 60 streaming multiprocessors (SMs).

Questions: how are the memory controllers, caches and SMs interconnected? Is there a ring-based structure? A subset of a crossbar? Something else? Link-based interconnect? Is this information purposely not disclosed?

Thanks,

/Joel

Robert_Crovella · July 23, 2017, 7:48pm

First, let me say that AFAIK, none of this is published to the level of detail that you are asking. So it’s possible I may make some errors.

In general, in GPUs, DRAM is partitioned. Each memory controller handles a DRAM partition. There is a translation from GPU logical linear memory space to a map that defines where each byte comes from (i.e. which partition, and from which segment and where in that segment, in that partition.)

As far as I know, this translation scheme is not published, and it does change from GPU architecture to GPU architecture, in some cases. With sufficient reverse engineering, this might be something that is discoverable, but as far as I know, the discoverability has become harder since Fermi GPUs and beyond.

A given byte in the GPU logical (programmer’s perspective) linear address space therefore maps to a single partition. Each portion of L2 cache in each memory controller is only responsible for handling the L2 requests for data associated with that partition. Therefore, it is not possible for a data item to belong to a line in more than one L2 “piece”.

So the answer to the second question I have quoted above is “Yes”, and the answer to the third question is “No”. Although a L2 “line” at this level does fit the definition of “line”, it may not fit the definition of what you have in your head for what that line represents in terms of the mapping from logical linear space.

For all of your questions, the information is not made available in anything like a hardware description that I have ever come across. It’s obvious to me you’ve already read the Pascal whitepaper, and I don’t know of a more comprehensive description than that. Occasionally you will find NVIDIA people more knowledgeable than I sharing some details informally, and you may come across reverse-engineering technical papers that shed some light on it.

JoelGrodstein · July 24, 2017, 1:09pm

Thanks for the info. I kind of suspected that the information was probably not released, but now I know for sure :-)

/Joel

sakjain92 · August 27, 2019, 5:05am

Hi Joel,

I know this is a late reply, but you might find all the information you are seeking via this paper (Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs | IEEE Conference Publication | IEEE Xplore) (Open source link: http://www.andrew.cmu.edu/user/sakshamj/papers/FGPU_RTAS_2019_Fractional_GPUs_Software_based_Compute_and_Memory_Bandwidth_Reservation_for_GPUs.pdf )

Approx. half of the paper deals with how to reverse engineering NVIDIA GPUs (Taking Pascal and Volta architecture as examples, specifically GTX 1080, GTX 1070 and Tesla V100) It goes specifically into the mapping between L2 cache and DRAM, apart from revealing various other details via reverse engineering. Look at sections “Reverse-Engineering of DRAM Bank Addressing”, “Reverse-Engineering of L2 Cache set Addressing”, “GPU Memory Hierarchy”

If you are further interested, the code used for reverse enigneering is available at GitHub - sakjain92/Fractional-GPUs: Splits single Nvidia GPU into multiple partitions with complete compute and memory isolation (wrt to performace) between the partitions

Hope this information will be helpful for you (Hopefukly you are still teaching the course on parallel computing)

If you need have any questions, you can contact me (My email id is there in the paper). I welcome all feedback

joel.grodstein · August 28, 2019, 8:06pm

Thanks, Saksham. I’ll read your paper & let you know if I have any questions.

/Joel

Topic		Replies	Views
Multiprocessor architecture CUDA Programming and Performance	11	830	November 25, 2020
some question about "384-bit memory bus from device memory to L2 cache" CUDA Programming and Performance	2	1248	September 30, 2010
Cache model and replacement policies for GPU memory CUDA Programming and Performance	4	3704	December 30, 2019
CUDA on G80 hardware questions... Mapping the execution model to hardware CUDA Programming and Performance	10	12410	April 10, 2008
Off-chip memory access CUDA Programming and Performance hw , architecture-and-design	5	605	February 16, 2024
Pascal L1 cache CUDA Programming and Performance	21	11851	January 20, 2024
Does the GPU have a similar technology for L1 caches or will NVIDIA bring out similar technology in future? CUDA Programming and Performance	4	871	December 14, 2021
Difference between L2 read/write transactions and L2_L1 read/write transactions ? CUDA Programming and Performance	3	1448	August 28, 2019
Question about multi-GPU programming Memory accesses and sharing CUDA Programming and Performance	10	7206	January 13, 2009
Where does the PCIe interconnect exists on GPU architecture? CUDA Programming and Performance	2	200	April 22, 2025

Any information on GPU on-die memory architecture?

Related topics