Global to shared memory transfer overlap with compuation

I have written codes of tiled matrix multiplication using shared memory.

One written naively without asynchronous memory transfers i.e bring tiles of matrix from global to shared memory, wait until the transfer is complete and then compute on it. Bring the next batch and repeat the process

Other code is using cuda::pipeline API which overlaps the memory transfer for Nth stage with the computation of (N-1)th stage using cuda::memcpy_async instructions.

The second code which is expected to perform better, does so on my machine (RTX 3050) but performs negatively on A100 machine.

GPU-PipelineBoost.zip (2.6 MB)

Steps to follow after downloading the .zip file

  1. Unzip the file
  2. run results.sh from the root directory
  3. Runtime results will be present in output/figs directory

The shell script averages the runtime of all codes over 6 runs and a final plot is made to compare the runtimes.

Codes begin run are matmul_2s, matmul_3s, matmul_4s, matmul_5s, no_Warp, matmul

  1. First 4 (number of pipeline stages are different)are coded using cuda pipeline (asynchronous memory transfers)

  2. no_Warp does matrix multiplication using shared memory but without asynchronous memory transfers

  3. matmul does matrix multiplication by accessing global memory directly

The Question is : Why are the pipeline codes not performant on A100 ??

NOTE : if the script doesnt run properly check the path to your nvcc compiler and specify the -arch flag according to the compute capability of your GPU