Global to shared memory transfer overlap with compuation

cs21btech11027 · November 1, 2024, 11:58am

I have written codes of tiled matrix multiplication using shared memory.

One written naively without asynchronous memory transfers i.e bring tiles of matrix from global to shared memory, wait until the transfer is complete and then compute on it. Bring the next batch and repeat the process

Other code is using cuda::pipeline API which overlaps the memory transfer for Nth stage with the computation of (N-1)th stage using cuda::memcpy_async instructions.

The second code which is expected to perform better, does so on my machine (RTX 3050) but performs negatively on A100 machine.

GPU-PipelineBoost.zip (2.6 MB)

Steps to follow after downloading the .zip file

Unzip the file
run results.sh from the root directory
Runtime results will be present in output/figs directory

The shell script averages the runtime of all codes over 6 runs and a final plot is made to compare the runtimes.

Codes begin run are matmul_2s, matmul_3s, matmul_4s, matmul_5s, no_Warp, matmul

First 4 (number of pipeline stages are different)are coded using cuda pipeline (asynchronous memory transfers)
no_Warp does matrix multiplication using shared memory but without asynchronous memory transfers
matmul does matrix multiplication by accessing global memory directly

The Question is : Why are the pipeline codes not performant on A100 ??

NOTE : if the script doesnt run properly check the path to your nvcc compiler and specify the -arch flag according to the compute capability of your GPU

Topic		Replies	Views
Why pipeline built with cuda::memcpy_async is slower than sync implementation? CUDA Programming and Performance cuda , kernel , wsl	0	879	May 10, 2023
Cuda Kernel Slower when using Cuda Pipelines Despite avoiding Bank Conflicts CUDA Programming and Performance	0	277	January 30, 2024
Question about Pinned memory CUDA Programming and Performance	8	2049	June 16, 2016
No speedup with async shared memory in stencil CUDA Programming and Performance	1	683	July 7, 2021
The performance result of CUDA sample globalToShmemAsyncCopy is puzzled CUDA Programming and Performance	1	624	December 9, 2021
Asynchronous copying on hopper GPU from shared to global CUDA Programming and Performance	2	59	October 28, 2025
Small Memory Transfers with CudaMemcpyAsync CUDA Programming and Performance	12	413	March 3, 2025
An Efficient Matrix Transpose in CUDA C/C++ Technical Blog	31	2916	October 30, 2020
Matrix Multiplication: Shared vs Global Memory CUDA Programming and Performance	1	3725	June 27, 2011
Some CUDA/GPU implementation related questions CUDA Programming and Performance	6	2342	May 30, 2009

Global to shared memory transfer overlap with compuation

Related topics