Shared memory - full duplex support?

sundaresan · September 22, 2024, 4:44am

I read about cuda::memcpy_async used in Ampere and newer architectures. And also Cutlass pipelining strategy (CUTLASS: Fast Linear Algebra in CUDA C++ | NVIDIA Technical Blog). The image in software pipelining seemingly depicts that storing of data (say Block 1) from global memory to shared memory and loading of data (say Block 2) from shared memory to registers could happen in parallel.

I have a query in this regard. Does the “store shared” and “load shared” happen in parallel (not interleaved or concurrent), i.e. does shared memory support full duplex mode (write and read simultaneously)?

I am currently trying to write an optimized GPU matMul kernel. If shared memory does not support full duplex, I might have to change my code.

Robert_Crovella · September 23, 2024, 2:11am

Shared memory bandwidth is 32 bits per bank per clock, per SM. It does not matter if it is read or write, but you cannot exceed that bandwidth even if doing both reads and writes. Shared can support, from among all the SMSPs on a SM, a maximum of one LDS or one STS instruction per clock cycle. You cannot do two per clock cycle (one LDS and one STS). One might question that bandwidth statement based on the age of that blog article, but there is no indication it has changed at least through volta.

system · October 7, 2024, 2:12am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can memory read and write operations overlap? CUDA Programming and Performance cuda , performance	1	508	January 18, 2024
Why multi-stage can accelerate GEMM? CUDA Programming and Performance	4	138	August 5, 2024
shared memory access when multiple blocks CUDA Programming and Performance	5	5528	February 12, 2009
Will ldgsts execute concurrently with instructions that load from shared memory? CUDA Programming and Performance	1	240	March 12, 2024
Shared Memory Buffer CUDA Programming and Performance	1	2680	May 13, 2011
Can threads from different warps access shared memory at the same time? CUDA Programming and Performance	4	528	April 22, 2024
Shared memory : shared access CUDA Programming and Performance	4	2020	July 21, 2008
Do kernels launched with CUDA MPS share global memory bandwidth? CUDA Programming and Performance	1	467	August 29, 2019
Various general beginner questions constream processors CUDA Programming and Performance	1	918	May 20, 2011
Registers and Shared Memory question CUDA Programming and Performance	7	5447	September 10, 2007

Shared memory - full duplex support?

Related topics