About async loading

If I have already load some data in shared memory. But due to bank conflict, in some turns, I need to change its layout in shared memory. Will use async loading be faster than directly reorder them? (As direct loading needs to through register, but if async can use data already loaded in shared memory, it will cost less time). I can’t ensure these data are still in L1 cache, but I’m sure that they are in shared memory.

Do you want - for your layout - to reorder between different read transactions, needing the SM? E.g. recombine bytes with permute? Or use large transactions per thread (16B) and then break up the data?

I need to swap data in consecutive 16 banks to reduce bank conflict. As for 64B data, it may be used as first row in matrix A in the first turn, and to be used as the third row in matrix A in the next turn. Though I can swzzile it when storing, it will still cause bank conflict if I can’t swap it in 16 banks.

Generally async loading frees up threads for more compute.

But you loose flexibility for storing in shared memory.

More or less the element each thread asynchronously reads goes directly into shared memory.

Without async you could e.g. do 4 reads of 8 bytes each for every thread. Rearrange the data in each thread and write to shared memory.

Yes, I know I can directly use register to rearrange them. But I just wonder if I can save some time by using async as I need to heavily use tensor core. If these data stored in shared memory can be loaded by async loading at the same speed of storing in L1 cache, I can totally save this time.

The main limits are the speed of L2 providing data and the speed of shared memory storing the data.

Which resource do you want to free up by using async?

Why don’t you have full speed for loading from L2 and storing in shared memory now? L2 should be slower than shared memory.

If the limitation is loading the data fast enough to feed the Tensor Cores, then

  • the Tensor Core speed is a hard limit (except if you can improve the math and use less multiplications)
  • the data load from L2 or global memory has to be efficient (and the load should be in at least 32 bytes packages); if data is needed repeatedly, use L1 or shared memory or registers
  • you have to make sure that shared memory is not a bottleneck
  • other arithmetic data handling can be used moderately

I think I don’t understand async copy well. Do you means that, when async loading, data can only be loaded from L2 or gmem, but no in L1 cache?

It can. But that would not be the preferred use case. L1 has similar bandwidth and latency as shared memory. Why would you copy from L1 to shared memory and then read from shared memory?

You could do it to slightly resort the data and prevent uncoalesced access I guess.

Preferred would be to keep reliance on L1 to a minimum and directly copy to shared memory. And for repeated accesses reuse the first copy to/in shared memory.

So in your case you would copy several times from L1 to shared memory?

  • Because shared memory is not large enough?
  • Because each time the distribution on banks, when storing in shared memory, is different?
  • Because the threads do not know/remember where in shared memory the first copy went to? Because the L1 source is dynamic or your accesses quite random and not predictable/regular?

Because the data in shared memory needed to be use for several times by tensor core. But each time in random row. So there will exist serious bank conflict that can’t be avoid unless I change the layout of the data in shared memory frequently.

It can make sense.

But if you write to shared memory once per read, you occupy shared memory bandwidth twice as much. Offsetting improvements by avoiding bank conflicts.

The LD/ST units possibly could be used 3x as much: Read from L1, write to shared, read from shared. I am not sure about their amount of usage for async copies.

Just be aware of this.

Besides bandwidth, you could reduce the number of transactions with element sizes of 8B or 16B. It could be beneficial in this setup (if the data layout in global and shared memory allows that element size).

So the best thing I should assume is that when the tensor core is executing, the async copy use the idle shared memory bandwidth. In other case, it will slower the speed of ldmatrix

In the best case (for efficient reads + tensor core execution, not the best case for async copies) the 4 tensor cores per SM (1 per partition) and the 1 shared memory per SM are active at the same time. There is perhaps no idling.

They are executed at the same time for reasons of different level:

  1. Shared memory reads are asynchronous (short scoreboard). If you unroll loops the compiler may move the read of the next iteration to be run during a Tensor core computation for the same warp.

  2. Each SM partition runs several warps at the same time, some warps are reading shared memory, some are running the tensor cores.

  3. Then additionally shared memory is shared between the partitions.

If you do not fully use the LD/ST units and not fully use the shared memory bandwidth (which in addition could probably be shared with L1, as L1 is the same silicon), then you can do async copies in parallel without negative performance impact.

What you do, could be a good idea. Just stating the facts to consider.

Thanks!