The performance result of CUDA sample globalToShmemAsyncCopy is puzzled

439290087 · December 6, 2021, 8:50am

I’ve tested this sample on the 2070s and 3090, both devices show that the Naive kernel is the fastest. Does this sample only introduce the API usage, not the benefits?

439290087 · December 9, 2021, 2:54am

Here are logs on 3090:

-----------------------------SM 86-----------------------------------------------------------------------
Running kernel = 0 - AsyncCopyMultiStageLargeChunk
Computing result using CUDA Kernel…
done
Performance= 2715.96 GFlop/s, Time= 50.604 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 1 - AsyncCopyLargeChunk
Computing result using CUDA Kernel…
done
Performance= 2721.41 GFlop/s, Time= 50.503 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 2 - AsyncCopyLargeChunkAWBarrier
Computing result using CUDA Kernel…
done
Performance= 1641.61 GFlop/s, Time= 83.722 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 3 - AsyncCopyMultiStageSharedState
Computing result using CUDA Kernel…
done
Performance= 1815.21 GFlop/s, Time= 75.715 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 4 - AsyncCopyMultiStage
Computing result using CUDA Kernel…
done
Performance= 2678.29 GFlop/s, Time= 51.316 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 5 - AsyncCopySingleStage
Computing result using CUDA Kernel…
done
Performance= 2688.78 GFlop/s, Time= 51.116 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 6 - Naive
Computing result using CUDA Kernel…
done
Performance= 3129.04 GFlop/s, Time= 43.924 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
Running kernel = 7 - NaiveLargeChunk
Computing result using CUDA Kernel…
done
Performance= 3050.63 GFlop/s, Time= 45.053 msec, Size= 137438953472 Ops, WorkgroupSize= 256 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

Topic		Replies	Views
Using globalToShmemAsyncCopy CUDA Programming and Performance	5	557	August 4, 2021
Asynchronous copying on hopper GPU from shared to global CUDA Programming and Performance	2	84	October 28, 2025
Why pipeline built with cuda::memcpy_async is slower than sync implementation? CUDA Programming and Performance cuda , kernel , wsl	0	887	May 10, 2023
Coalesced and conflict free memory access using cuda::memcpy_async/cp.async CUDA Programming and Performance cuda	6	1009	November 13, 2024
No speedup with async shared memory in stencil CUDA Programming and Performance	1	689	July 7, 2021
async_work_group_copy? CUDA Programming and Performance	0	1846	October 21, 2009
cudaMemcpyAsync not behaving asynchronously CUDA Programming and Performance	5	2528	July 4, 2008
Odd cudaMemcpyAsync() behavior with Kepler K20c and CUDA 5.0 CUDA Programming and Performance	0	974	January 14, 2013
Global to shared memory transfer overlap with compuation CUDA Programming and Performance	0	77	November 1, 2024
Why does not have any performance improvement of asynchronous copy on A100 with Cuda Samples? CUDA Programming and Performance cuda	0	253	August 2, 2023

The performance result of CUDA sample globalToShmemAsyncCopy is puzzled

Related topics