Rtx 3090 & a100 memory frequency

uniadam · December 22, 2021, 10:13pm

With nvidia-smi -q I am getting information about GPU frequency. For RTX-3090 is:

Clocks
    Graphics                          : 1695 MHz
    SM                                : 1695 MHz
    Memory                            : 9751 MHz
    Video                             : 1485 MHz

It seems that Memory frequency is a crazy number. Is it correct (9751 MHz)?

Or here its around 19,5GHZ.

Memory Size	24GB GDDR6X
Memory Clock Effective	19500 MHz

For A100 I am seeing:

Clocks
Graphics : 765 MHz
SM : 765 MHz
Memory : 1215 MHz
Video : 705 MHz

It seems slow!

Robert_Crovella · December 22, 2021, 10:35pm

For GDDR memory, generally the “pumped” nature (relative to DDR) is included in that “clock”. For HBM2 memory, the bus behavior is fundamentally different. The bandwidth of HBM2 memory is something like clock x 2 x 1024 bits/stack x number of stacks. The bandwidth of GDDR memory is something like clock x 2 x bus width. So the calculations are similar (bits per stack x number of stacks is effectively bus width), and the GDDR “clock” is adjusted by the bits per transfer (“pumped”) to allow for a similar calculation.

RTX 3090 (not Ti) is reported to have ~936GB/s of total bandwidth, and a 384 bit wide bus (384 “lanes”), with 19.5Gb/s per bus lane. This 19.5 Gb/s number is exactly twice the 9751MHz reported memory “clock”. So the 9751MHz number is what the clock would be if it were dealing with DDR (“double-pumped”) memory of that width to deliver that bandwidth. However the actual clock is reported as 1219MHz, so the actual delivery of bits per lane is 8x the number of what you would expect for DDR, and that 8x number gives the multiplier to 9751MHz that you are reporting.

384 x 2 x 9751 / 8 = 936GB/s

For HBM2:

5 stacks x 1024 bits/stack x 2 x 1215MHz / 8 = 1.555 TB/s (This is the 40GB A100 number - The 80GB A100 sku has higher clocks and therefore higher memory bandwidth.)

Stated another way, the GDDR memory system appears to transfer many bits per clock per lane (16, in this case) whereas the the HBM2 memory appears to only transfer 2 bits per clock per lane. The reason the two bus bandwidths are in the same ballpark, for almost the same clock, is that the HBM2 memory bus width is ~13x wider.

njuffa · December 22, 2021, 11:12pm

Side note: The tradeoffs of the two design styles, wide & slow for HBM and narrow & fast for GDDR, are significantly higher power consumption to drive the memory I/O pins in the case of GDDR (as beefy transistors are needed to switch fast; these also require more die area) and significantly higher cost for HBM (among other things, every “pin” costs money)