Shared memory - full duplex support?

I read about cuda::memcpy_async used in Ampere and newer architectures. And also Cutlass pipelining strategy (CUTLASS: Fast Linear Algebra in CUDA C++ | NVIDIA Technical Blog). The image in software pipelining seemingly depicts that storing of data (say Block 1) from global memory to shared memory and loading of data (say Block 2) from shared memory to registers could happen in parallel.

I have a query in this regard. Does the “store shared” and “load shared” happen in parallel (not interleaved or concurrent), i.e. does shared memory support full duplex mode (write and read simultaneously)?

I am currently trying to write an optimized GPU matMul kernel. If shared memory does not support full duplex, I might have to change my code.

Shared memory bandwidth is 32 bits per bank per clock, per SM. It does not matter if it is read or write, but you cannot exceed that bandwidth even if doing both reads and writes. Shared can support, from among all the SMSPs on a SM, a maximum of one LDS or one STS instruction per clock cycle. You cannot do two per clock cycle (one LDS and one STS). One might question that bandwidth statement based on the age of that blog article, but there is no indication it has changed at least through volta.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.