I read about cuda::memcpy_async
used in Ampere and newer architectures. And also Cutlass pipelining strategy (CUTLASS: Fast Linear Algebra in CUDA C++ | NVIDIA Technical Blog). The image in software pipelining seemingly depicts that storing of data (say Block 1) from global memory to shared memory and loading of data (say Block 2) from shared memory to registers could happen in parallel.
I have a query in this regard. Does the “store shared” and “load shared” happen in parallel (not interleaved or concurrent), i.e. does shared memory support full duplex mode (write and read simultaneously)?
I am currently trying to write an optimized GPU matMul kernel. If shared memory does not support full duplex, I might have to change my code.