GPU bandwidth

Hi,

I would like to know what is the bandwidth for loading data from GPU DRAM to shared memory and the bandwidth for loading data from DRAM to the register?

For example, on RTX 3090, which is higher?

Can we profile these two stats via nsight compute or some other simple methods?

Apologies, the morning coffee hadn’t kicked in and I misread your post, hence deleted.

I wouldn’t expect either one to be higher. In the ordinary case, they are both following roughly the same path:

DRAM->L2->L1->register->shared

The only difference is the last step, which should not have a meaningful effect on observable bandwidth, assuming no other activity on the GPU and sensible coding.

Nsight compute has a memory chart as well as a memory report section that you may want to become familiar with.

Of course, if you have significant “other” usage of shared memory in your code, then the transfer of data to shared memory could become a bottleneck. But considering only the register->shared mem path/step on the RTX3090, it should support (with 64 or 128bit per thread stores) up to 128 bytes per SM per clock, which exceeds the DRAM bandwidth on RTX3090, so the shared store at the end should not be a performance limiter, by itself. (&)

Likewise, if you have significant register pressure in your code, then the usage of a register footprint to move the data to shared could be an issue. In that case, you could investigate an async global->shared copy, which is a new feature in Ampere, and discussed in many forum posts.

(&) The math:
RTX 3090 has 82 SMs it seems. (you can confirm with deviceQuery). Likewise the base clock is 1395MHz. And the reported DRAM bandwidth is 936GB/s. The calculation for shared bandwidth would be 128 bytes x 82 SMs x 1.395GHz = 14,642GB/s aggregate, device-wide. So this far exceeds the 936GB/s of DRAM bandwidth, meaning DRAM bandwidth should be the limiting factor in both cases. Even if we drop the per thread store quantity to 32 bits, this will only halve the register->shared bandwidth (on GA10x) and that bandwidth still far exceeds DRAM bandwidth.

cuda::memcpy_async on GA10x supports global->shared without going through register file. The blog Controlling Data Movement to Boost Performance on the NVIDIA Ampere Architecture provides more information.

A well written benchmark that enables copy from all SMs and has sufficient threads to hide latency should be able to copy global → RF or global->shared at 80-95% of the L2 or DRAM bandwidth based upon the location of the data. For example, if the operation prior to the kernel was to cudaMemcpy HtoD the data you want in shared memory then the test program may be able to exceed DRAM B/W if the data is resident in the L2. When using NCU the default setting --cache-control all will invalidate the L2 prior to profiling the grid.

On a 3090 the maximum bandwidth will start to fall as SM count is decreased (e.g. launch less than SM count thread blocks). Each SM can receive 32B/cycle from L2.

I see. Thank you very much.