What is bandwidth in NV's hardware?

202476410arsmart · August 4, 2024, 11:09am

The core issue with multi-stage pipelining is understanding what the bandwidth truly is. Suppose there are two “memory read units” with constant read speeds. In this case, the bandwidth actually depends on whether these two units are continuously issuing read requests.

If the two units are connected to a single pipeline with a flow rate limit, then no matter how many requests they make, the total bandwidth won’t increase and will be limited by the pipeline’s capacity. Common sense suggests it should be the latter, but I’ve come to realize that, at least with DRAM, it seems to be the former(DRAM has many ports and each has const bandwidth?).

For example, when I measure the L1 bandwidth, if a block occupies an entire SM, the more threads there are within the block, the higher the bandwidth(total data amount/time) linearly increases. This seems to indicate that even with 1024 threads, simple operations of reading values from global memory to shared memory do not fully utilize the L1 bandwidth.

Curefab · August 4, 2024, 2:48pm

It depends on the architecture and instruction mix, whether you can fill L1 bandwidth. There are ways to increase the bandwidth utilization. E.g. use 128bit-vector operations (instead of 32 bits per thread) for reading, additionally use the texture path. Shared memory is shared between the 4 SM Partitions, for full L1 bandwidth, do not read over shared memory.

Topic		Replies	Views
Why multi-stage can accelerate GEMM? CUDA Programming and Performance	4	200	August 5, 2024
Kernel with a for-loop over matrix columns results in worse bandwidth for longer ranges? CUDA Programming and Performance	4	651	April 27, 2021
float4 bandwidth advantages over plain float1 CUDA Programming and Performance	6	2583	July 2, 2018
Does Kepler (CC 3.0) GPU's L1 and shared memories consume each others' bandwidth? CUDA Programming and Performance	1	476	November 3, 2018
Shared memory bandwidth CUDA Programming and Performance	10	8509	November 10, 2007
Finding performance bottlenecks CUDA Programming and Performance	0	4782	May 28, 2007
global memory bandwidth problem CUDA Programming and Performance	4	1409	March 2, 2010
weird performance in GPU CUDA Programming and Performance	2	2701	October 17, 2011
How to know my kernel if Pipeline parallel by nsight compute Nsight Compute	6	876	April 18, 2023
Why the performance of tf32 tensor_core is poor? CUDA Programming and Performance	20	1767	August 8, 2023

What is bandwidth in NV's hardware?

Related topics