About async loading

Curefab · March 13, 2025, 11:30am

The main limits are the speed of L2 providing data and the speed of shared memory storing the data.

Which resource do you want to free up by using async?

Why don’t you have full speed for loading from L2 and storing in shared memory now? L2 should be slower than shared memory.

If the limitation is loading the data fast enough to feed the Tensor Cores, then

the Tensor Core speed is a hard limit (except if you can improve the math and use less multiplications)
the data load from L2 or global memory has to be efficient (and the load should be in at least 32 bytes packages); if data is needed repeatedly, use L1 or shared memory or registers
you have to make sure that shared memory is not a bottleneck
other arithmetic data handling can be used moderately

Topic		Replies	Views
Issues about L1 cache CUDA Programming and Performance	10	101	February 26, 2025
Issues about async on A100 CUDA Programming and Performance	22	81	March 19, 2025
Load data for tensor core CUDA Programming and Performance	23	76	February 5, 2025
About async copy CUDA Programming and Performance	9	50	May 8, 2025
Trade-off Between Bank Conflict and Thread Count in Shared Memory Access CUDA Programming and Performance cuda	9	57	June 23, 2025
Using Shared Memory in CUDA C/C++ Technical Blog	36	2013	October 8, 2020
Quick memory access question. Threads fighting over a data source? CUDA Programming and Performance	9	4058	October 20, 2008
life span of shared memory CUDA Programming and Performance	15	6967	April 27, 2011
Data load question CUDA Programming and Performance	3	35	December 18, 2024
Can one warp be doing one thing while another warp does something else? CUDA Programming and Performance	6	824	July 11, 2017