About async loading

The main limits are the speed of L2 providing data and the speed of shared memory storing the data.

Which resource do you want to free up by using async?

Why don’t you have full speed for loading from L2 and storing in shared memory now? L2 should be slower than shared memory.

If the limitation is loading the data fast enough to feed the Tensor Cores, then

  • the Tensor Core speed is a hard limit (except if you can improve the math and use less multiplications)
  • the data load from L2 or global memory has to be efficient (and the load should be in at least 32 bytes packages); if data is needed repeatedly, use L1 or shared memory or registers
  • you have to make sure that shared memory is not a bottleneck
  • other arithmetic data handling can be used moderately