Memory bandwidth in terms of SM number

Hi, I’m not familiar with the underlying design of memory accessing hardware in A100.

I’m curious that whether the theoretical global memory access bandwidth (i.e., 1555GB/s) can be achieved by using only one SM?

And what’s the memory bandwidth to be like for each SM when multiple SMs are involved during kernel execution? Will they have no contentions among each other, or they will share the same bus for memory queries (indicating memory bandwidth can be preempted by other SMs)?

No, it cannot. To fully take advantage of the available memory bandwidth, you would want to write a kernel that uses multiple SMs. I don’t have further information that would be useful here, you may wish to study the A100 whitepaper, although I’m not suggesting all your questions are answered there.

If you wrote a copy kernel with a grid-stride loop, you could probably microbenchmark this yourself, to get some data.