GPU bandwidth

zwu065 · April 13, 2024, 12:21pm

Hi,

I would like to know what is the bandwidth for loading data from GPU DRAM to shared memory and the bandwidth for loading data from DRAM to the register?

For example, on RTX 3090, which is higher?

Can we profile these two stats via nsight compute or some other simple methods?

rs277 · April 13, 2024, 6:54pm

Apologies, the morning coffee hadn’t kicked in and I misread your post, hence deleted.

Robert_Crovella · April 15, 2024, 2:37pm

I wouldn’t expect either one to be higher. In the ordinary case, they are both following roughly the same path:

DRAM->L2->L1->register->shared

The only difference is the last step, which should not have a meaningful effect on observable bandwidth, assuming no other activity on the GPU and sensible coding.

Nsight compute has a memory chart as well as a memory report section that you may want to become familiar with.

Of course, if you have significant “other” usage of shared memory in your code, then the transfer of data to shared memory could become a bottleneck. But considering only the register->shared mem path/step on the RTX3090, it should support (with 64 or 128bit per thread stores) up to 128 bytes per SM per clock, which exceeds the DRAM bandwidth on RTX3090, so the shared store at the end should not be a performance limiter, by itself. (&)

Likewise, if you have significant register pressure in your code, then the usage of a register footprint to move the data to shared could be an issue. In that case, you could investigate an async global->shared copy, which is a new feature in Ampere, and discussed in many forum posts.

(&) The math:
RTX 3090 has 82 SMs it seems. (you can confirm with deviceQuery). Likewise the base clock is 1395MHz. And the reported DRAM bandwidth is 936GB/s. The calculation for shared bandwidth would be 128 bytes x 82 SMs x 1.395GHz = 14,642GB/s aggregate, device-wide. So this far exceeds the 936GB/s of DRAM bandwidth, meaning DRAM bandwidth should be the limiting factor in both cases. Even if we drop the per thread store quantity to 32 bits, this will only halve the register->shared bandwidth (on GA10x) and that bandwidth still far exceeds DRAM bandwidth.

Greg · April 15, 2024, 4:28pm

cuda::memcpy_async on GA10x supports global->shared without going through register file. The blog Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture provides more information.

A well written benchmark that enables copy from all SMs and has sufficient threads to hide latency should be able to copy global → RF or global->shared at 80-95% of the L2 or DRAM bandwidth based upon the location of the data. For example, if the operation prior to the kernel was to cudaMemcpy HtoD the data you want in shared memory then the test program may be able to exceed DRAM B/W if the data is resident in the L2. When using NCU the default setting --cache-control all will invalidate the L2 prior to profiling the grid.

On a 3090 the maximum bandwidth will start to fall as SM count is decreased (e.g. launch less than SM count thread blocks). Each SM can receive 32B/cycle from L2.

zwu065 · April 20, 2024, 9:42am

I see. Thank you very much.

Topic		Replies	Views
Rtx 3090 & a100 memory frequency CUDA Programming and Performance	2	3109	December 22, 2021
How to calculate the theoretical memory bandwidth? CUDA Programming and Performance	8	8437	December 18, 2024
Frequency Vs Memory Bandwidth CUDA Programming and Performance	2	8809	September 5, 2009
memory bandwidth device to SM bandwidth CUDA Programming and Performance	9	4721	June 10, 2008
Using bandwidthTest tool, D2D performance More than the official given bandwidth CUDA Programming and Performance cuda	6	848	October 28, 2022
How to calculate memory bandwidth from device properties ? CUDA Programming and Performance	11	5451	June 20, 2015
Bandwith Problem CUDA Programming and Performance	7	2633	March 16, 2009
GPU Memory how to find the GPU memory bandwidth CUDA Programming and Performance	10	17678	June 23, 2007
Shared memory bandwidth CUDA Programming and Performance	10	8509	November 10, 2007
GK104 / GK110 shared memory bandwidth discussion CUDA Programming and Performance	7	2021	December 2, 2012

GPU bandwidth

Related topics