I am reading here about the A100 memory.
With a new partitioned crossbar structure, the A100 L2 cache provides 2.3x the L2 cache read bandwidth of V100.
To optimize capacity utilization, the NVIDIA Ampere architecture provides L2 cache residency controls for you to manage data to keep or evict from the cache. A100 also adds Compute Data Compression to deliver up to an additional 4x improvement in DRAM bandwidth and L2 bandwidth, and up to 2x improvement in L2 capacity.
- Do you have similar items for RTX 3090, because they are from a same architecture?
- What is the partitioned crossbar structure?
- Are programmer able to control L2 for both?
Not my area of expertise. I would recommend a literature search. From what I understand, a partitioned N˟N-crossbar comprises (N/k)2 k˟k-crossbars (plus assorted multiplexers, queuing, etc) and in the ideal case bisection bandwidth increases by a factor of N/k. From vague memory: crossbars are very energy-hungry, and use of a partitioned crossbar may (conjecture!) somewhat alleviate that.
With rare exceptions, NVIDIA typically does not publicly describe microarchitectural components of their GPUs in detail. If there is something novel about the particular implementation used in the A100, it might be described in a patent or patent application. If you do find a relevant patent, consider carefully whether it would be a good idea to read it.
Check the A102 whitepaper. For description of the RTX 3090 cache (or architecture), I would look there first.
The programmer L2 controls are described in the programming guide. The L2 persistence mechanism applies to both cc8.0 (A100) and cc8.6 (A102/RTX 3090).
How is the behavior of a memory with high frequency and low bandwidth vs a a memory with low frequency and high bandwidth ?
Here RTX 3090 memory frequency is 9751 MHz vs A100 which is 1215 MHz. According to just this characteristic, how can analyze memory behavior?
RTX 3090 has a bus width of 384 bits and “memory clock” frequency of 9751 MHz. This leads to a calculated (peak theoretical) bandwidth of:
(384bits per transaction/8bits per byte)*(9751x10^6DDR transactions per second)*2 transactions per DDR transaction = 9.36 x 10^11 bytes/s = 936 GB/s
which is the generally reported number.
A100 has a bus width of 5120 bits and “memory clock” frequency of 1215MHz. Using the same methodology, we get for calculated peak theoretical bandwidth:
(5120bits per transaction/8bits per byte)*(1215*10^6DDR transactions per second)*(2 transactions per DDR transaction) = 1.555x10^12 bytes/sec
which is the the generally reported number (for the 40GB model).
Beyond that I don’t know how to answer the question:
I would say that the RTX 3090 has approximately 2/3 the bandwidth of the A100. To a first order approximation, bandwidth is bandwidth. As far as I know, NVIDIA doesn’t provide any further information about how to differentiate or compare the “type” of bandwidth on the RTX 3090 with the “type” of bandwidth on the A100, nor do any NVIDIA provided tools differentiate in this way, AFAIK.
You could also say that the A100 provides ECC protection, whereas the RTX 3090 does not, but that doesn’t really say anything about bandwidth.
I would assume that turning on ECC reduces bandwidth available to applications by 6.25% due to the in-band implementation of ECC (as on earlier GPU architectures), but I am not going to search for relevant text in NVIDIA’s docs.
Differences between a GDDR6 and a HBM2 memory subsystem should, to first order, not have an impact on CUDA programmers. If someone has information to the contrary, pointers to relevant publication would be much welcomed.
Latency of an unloaded memory interface is likely lower with GDDR6 by virtue of construction (direct link vs use of an interposer), but when maximizing bandwidth usage, latency tends to suffer, and so the higher overall bandwidth of an HBM2 memory subsystem will likely reduce the latency impact of fully loading the memory interface compared to GDDR6, so it is somewhat of a wash. It should also be noted that in general the design of GPUs emphasizes high throughput rather than low latency. Tens of thousands of threads are in flight to cover latencies all across the hardware.
The advantage of HBM2 – besides a reasonable increase in total bandwidth – is mostly that it allows for a significant reduction in the physical size and power consumption of the memory I/O blocks on the GPU (which makes for a higher GFLOPS/W ratio), while disadvantage is mostly in cost (for the memory chips themselves, the additional pins required, interposer cost, and additional engineering cost).
HBM2 memory does not use in-band methods to implement ECC (to a first order approximation). Therefore there is no corresponding reduction in bandwidth as there was with GPU implementations on top of GDDR memory, and the general advice is to leave ECC on all the time for these types of GPUs- there is no downside (to a first order approximation).
One reference for this is here
HBM2 memories, on the other hand, provide dedicated ECC resources, allowing overhead-free ECC protection.4
I used “to a first order approximation” which provides some allowance for the linked note:
4 As an exception, scattered writes to HBM2 see some overhead from ECC but much less than the overhead with similar access patterns on ECC-protected GDDR5 memory.
And since ECC cannot be enabled for GeForce GPUs (with few exceptions - RTX 3090 is not an exception) I felt it was appropriate to say:
Thanks for the clarification regarding ECC; I did not know that HBM2 provides dedicated resources for this.