A100 & RTX3090 Memory Similarities and Differences

uniadam · September 18, 2022, 10:45pm

I am reading here about the A100 memory.

With a new partitioned crossbar structure, the A100 L2 cache provides 2.3x the L2 cache read bandwidth of V100.

To optimize capacity utilization, the NVIDIA Ampere architecture provides L2 cache residency controls for you to manage data to keep or evict from the cache. A100 also adds Compute Data Compression to deliver up to an additional 4x improvement in DRAM bandwidth and L2 bandwidth, and up to 2x improvement in L2 capacity.

Do you have similar items for RTX 3090, because they are from a same architecture?
What is the partitioned crossbar structure?
Are programmer able to control L2 for both?

njuffa · September 18, 2022, 11:41pm

Not my area of expertise. I would recommend a literature search. From what I understand, a partitioned N˟N-crossbar comprises (N/k)² k˟k-crossbars (plus assorted multiplexers, queuing, etc) and in the ideal case bisection bandwidth increases by a factor of N/k. From vague memory: crossbars are very energy-hungry, and use of a partitioned crossbar may (conjecture!) somewhat alleviate that.

With rare exceptions, NVIDIA typically does not publicly describe microarchitectural components of their GPUs in detail. If there is something novel about the particular implementation used in the A100, it might be described in a patent or patent application. If you do find a relevant patent, consider carefully whether it would be a good idea to read it.

Robert_Crovella · September 18, 2022, 11:51pm

Check the A102 whitepaper. For description of the RTX 3090 cache (or architecture), I would look there first.

The programmer L2 controls are described in the programming guide. The L2 persistence mechanism applies to both cc8.0 (A100) and cc8.6 (A102/RTX 3090).

uniadam · September 27, 2022, 3:19pm

How is the behavior of a memory with high frequency and low bandwidth vs a a memory with low frequency and high bandwidth ?

Here RTX 3090 memory frequency is 9751 MHz vs A100 which is 1215 MHz. According to just this characteristic, how can analyze memory behavior?

Robert_Crovella · September 27, 2022, 3:38pm

RTX 3090 has a bus width of 384 bits and “memory clock” frequency of 9751 MHz. This leads to a calculated (peak theoretical) bandwidth of:

(384bits per transaction/8bits per byte)*(9751x10^6DDR transactions per second)*2 transactions per DDR transaction = 9.36 x 10^11 bytes/s = 936 GB/s

which is the generally reported number.

A100 has a bus width of 5120 bits and “memory clock” frequency of 1215MHz. Using the same methodology, we get for calculated peak theoretical bandwidth:

(5120bits per transaction/8bits per byte)*(1215*10^6DDR transactions per second)*(2 transactions per DDR transaction) = 1.555x10^12 bytes/sec

which is the the generally reported number (for the 40GB model).

Beyond that I don’t know how to answer the question:

I would say that the RTX 3090 has approximately 2/3 the bandwidth of the A100. To a first order approximation, bandwidth is bandwidth. As far as I know, NVIDIA doesn’t provide any further information about how to differentiate or compare the “type” of bandwidth on the RTX 3090 with the “type” of bandwidth on the A100, nor do any NVIDIA provided tools differentiate in this way, AFAIK.

You could also say that the A100 provides ECC protection, whereas the RTX 3090 does not, but that doesn’t really say anything about bandwidth.

njuffa · September 28, 2022, 5:37am

I would assume that turning on ECC reduces bandwidth available to applications by 6.25% due to the in-band implementation of ECC (as on earlier GPU architectures), but I am not going to search for relevant text in NVIDIA’s docs.

Differences between a GDDR6 and a HBM2 memory subsystem should, to first order, not have an impact on CUDA programmers. If someone has information to the contrary, pointers to relevant publication would be much welcomed.

Latency of an unloaded memory interface is likely lower with GDDR6 by virtue of construction (direct link vs use of an interposer), but when maximizing bandwidth usage, latency tends to suffer, and so the higher overall bandwidth of an HBM2 memory subsystem will likely reduce the latency impact of fully loading the memory interface compared to GDDR6, so it is somewhat of a wash. It should also be noted that in general the design of GPUs emphasizes high throughput rather than low latency. Tens of thousands of threads are in flight to cover latencies all across the hardware.

The advantage of HBM2 – besides a reasonable increase in total bandwidth – is mostly that it allows for a significant reduction in the physical size and power consumption of the memory I/O blocks on the GPU (which makes for a higher GFLOPS/W ratio), while disadvantage is mostly in cost (for the memory chips themselves, the additional pins required, interposer cost, and additional engineering cost).

Robert_Crovella · September 28, 2022, 12:54pm

HBM2 memory does not use in-band methods to implement ECC (to a first order approximation). Therefore there is no corresponding reduction in bandwidth as there was with GPU implementations on top of GDDR memory, and the general advice is to leave ECC on all the time for these types of GPUs- there is no downside (to a first order approximation).

One reference for this is here

HBM2 memories, on the other hand, provide dedicated ECC resources, allowing overhead-free ECC protection.4

I used “to a first order approximation” which provides some allowance for the linked note:

4 As an exception, scattered writes to HBM2 see some overhead from ECC but much less than the overhead with similar access patterns on ECC-protected GDDR5 memory.

And since ECC cannot be enabled for GeForce GPUs (with few exceptions - RTX 3090 is not an exception) I felt it was appropriate to say:

njuffa · September 28, 2022, 6:27pm

Thanks for the clarification regarding ECC; I did not know that HBM2 provides dedicated resources for this.

Topic		Replies	Views
How to correctly write code to test A100 L2 bandwidth？ CUDA Programming and Performance	6	2606	October 17, 2023
Rtx 3090 & a100 memory frequency CUDA Programming and Performance	2	3568	December 22, 2021
NVIDIA Ampere Architecture In-Depth Technical Blog	0	1019	August 25, 2020
How to reach peak bandwidth of L2 cache on A100 CUDA Programming and Performance	3	1931	January 4, 2022
Are Lovelace GPU L2 caches partitioned like the Ampere ones? CUDA Programming and Performance	4	314	September 28, 2024
L2 Bandwidth Value for A100 Calculation CUDA Programming and Performance	5	260	January 28, 2025
A100 L2 Partition Bandwidth CUDA Programming and Performance	3	536	June 4, 2024
Bandwidth test of pageable memory is mush different in 2 computer CUDA Programming and Performance cuda	18	163	January 6, 2026
GPU bandwidth CUDA Programming and Performance	4	1997	April 20, 2024
Question about Tesla L4 performance vs RTX A4500 with lower memory bandwidth GPU Hardware	2	161	November 26, 2025

A100 & RTX3090 Memory Similarities and Differences

Related topics