What is bandwidth in NV's hardware?

The core issue with multi-stage pipelining is understanding what the bandwidth truly is. Suppose there are two “memory read units” with constant read speeds. In this case, the bandwidth actually depends on whether these two units are continuously issuing read requests.

If the two units are connected to a single pipeline with a flow rate limit, then no matter how many requests they make, the total bandwidth won’t increase and will be limited by the pipeline’s capacity. Common sense suggests it should be the latter, but I’ve come to realize that, at least with DRAM, it seems to be the former(DRAM has many ports and each has const bandwidth?).

For example, when I measure the L1 bandwidth, if a block occupies an entire SM, the more threads there are within the block, the higher the bandwidth(total data amount/time) linearly increases. This seems to indicate that even with 1024 threads, simple operations of reading values from global memory to shared memory do not fully utilize the L1 bandwidth.

It depends on the architecture and instruction mix, whether you can fill L1 bandwidth. There are ways to increase the bandwidth utilization. E.g. use 128bit-vector operations (instead of 32 bits per thread) for reading, additionally use the texture path. Shared memory is shared between the 4 SM Partitions, for full L1 bandwidth, do not read over shared memory.

1 Like