Padding of mma operation

I use fp16 mma m16n8k16 for calculation, the kernel is like below:
map contains some integer index and -1, when map[i] = -1, it means that index don’t need to be calculate (just for padding, because mma need to align to 8)

__global__ f(int* map, ......) {
      __shared__ result[...];
      half tmp_result[4];
      ...
      unsigned *C       = reinterpret_cast<unsigned *> (&tmp_result);
      unsigned *D       = reinterpret_cast<unsigned *> (&tmp_result);

      // load result
      tmp_result[0] = result[max(map[0], 0)];
      tmp_result[1] = result[max(map[1], 0)];
      tmp_result[2] = result[max(map[2], 0)];
      tmp_result[3] = result[max(map[3], 0)];

      __asm__ __volatile__ (
          "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
          "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
          : "=r"(D[0]), "=r"(D[1])
          :
              "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),
              "r"(B[0]), "r"(B[1]),
              "r"(C[0]), "r"(C[1])
      );
      
      if (map[0] != -1) {result[map[0]] = tmp_result[0];}
      if (map[1] != -1) {result[map[1]] = tmp_result[1];}
      if (map[2] != -1) {result[map[2]] = tmp_result[2];}
      if (map[3] != -1) {result[map[3]] = tmp_result[3];}
}

will there be any more efficient way or elegent way to realize the code? I think use 0 as the default is not a good way to avoid bank conflict.

If you load the same address there is no bank conflict, only if you load from the same bank, but from a different address.

Thanks!Some if there are other better way to realize this padding? like some api?

Also, if I want to use ldmatrix, will it be faster than I do load by hand in this case?

ldmatrix is useful, if thread 0/4/8/12/…/28 and 1/5/9/13/…/29 and so on process short values, which are stored in a successive fashion in shared memory.

So the short for thread 4 is directly stored after the short for thread 0.

I do not know, how your data is stored (or even the values of map), so it is difficult to say, whether there is a better way.

In your case reading individual short values (result is a half type with 2 bytes per element) does not seem to be the most efficient, as Cuda is optimized for 4 bytes accesses (and also accepts 8 bytes or 16 bytes); so try to use half2 or ldmatrix, if possible.

If it is not possible, it depends, whether shared memory bandwidth is a bottleneck in your case. If it is, you can slightly reduce the load with a combination of shared memory read of 2 * 4 bytes (combining two transactions), keeping 2 shorts, and shuffling the other 2 shorts in one transaction to another thread, needing those two; and at the same time getting two shorts back. With it you replace reading 4 shorts (4 transactions) with reading 2*4 bytes (2 transactions) and shuffling 4 bytes (1 transaction), i.e. 25% less shared memory usage. Perhaps even less, if you can directly read from global memory. The necessary selection and recombination of the shorts before and after shuffling is often faster than shared memory, if shared memory is the bottleneck.

It is sad that map is transferred from outside, so though my data stored in order, I can’t access them in order. It might cause bank conflict due to the map array.
Wait, I don’t understand this sentence: “if thread 0/4/8/12/…/28 and 1/5/9/13/…/29 and so on process short values, which are stored in a successive fashion in shared memory.” Do you mean that if data processed by that specific 8 threads are stored successive, ldmatrix can do much better then by hand?

Like if threads 0/4/8…/28 's data are stored in data[0] - data[7], 1/5/9…/29 's data are stored in data[128] - data[135], 2/6/10…/30 's data are stored in data[256-263], ldmatrix can do very well then?

I don’t understand the ldmatrix (also the stmatrix) mechanism clearly.

Yes, exactly. With the practical addition that each of those thread groups may load more than one data block (actually, either 2, 4 or 8 data blocks), whose indices do not have to relate to each other.

The threads with the same remainder dividing by 4 make sense looking at the data input requirements of the mma instruction, which imagines a 8x4 thread configuration.

Together with the possibility to store into shared memory in a certain way and to swap rows or columns in mma inputs and outputs you have quite a lot of flexibility to optimize memory accesses.

Does it possible to have bank conflict when using ldmatrix? like different thread groups may access to same bank.

Is the contents of map actually the same over all your threads? Probably no?

You get 4 results per thread, want to save them at an arbitrary position and only conditionally.

Is there any regularity? E.g. that certain threads can only store at certain indices or that each mma instruction (in a possible loop) can only store at certain indices depending on the iteration?

Is the same map reused over several iterations?

The code clip is just a simplified demo of my real code. I can alter my condition into below question:
I need a 16 * 8 fp16 matrix for mma. I can ensure each column’s 16 data are aligned and successive in shared memory. But for data between different column, I can ensure nothing.
For example, assume column 0 access data[0 - 15], will that be bank conflict if other column access data[64 - 79] (as they are on same bank)?

I ensure each column will access data[16 * k + 0 - 16 * k + 15]. Will that be any extra limitation to k for each column for the efficiency of ldmatrix?

Let’s assume the version of ldmatrix, reading 2 short values.

You have 2 * 4 = 8 pointers to 16-byte blocks.

Two for each of the 4 groups of 8 threads.

Each pointer accesses 4 banks (2 * 8 bytes = 4 * 4 bytes = 16 bytes).
To avoid bank conflicts, the 8 pointers will need to access different groups of 4 banks.

I believe quite firmly from my past usage of ldmatrix (but I am not 100.0% sure, you have to test with Nsight Compute), that there is no restriction, which pointer points to which group of 4 banks.

It gets more complicated with the ldmatrix variant loading 8 short values per thread and having 32 pointers. Because it needs 4 shared memory accesses anyway. Can they be arbitrarily aligned within the 128 bytes and still the optimum access pattern is found?

Generally, if you cannot assure that the pointers point to different groups of 4 banks, and still want to avoid bank conflicts, you have to combine several accesses (for different mma instructions) and reorder.

You have accesses, which are in different groups of 4 banks: A, B, C, D, E, F, G, H.

In the first ldmatrix call threads 0/4/8/… load any index within A and B (2 pointers), threads 1/5/9/… load any index within C and D, and so on.

In the second ldmatrix call, threads 0/4/8/… load any index within C and D, threads 1/5/9/… load any index within E and F, and so on.

Afterwards you have to reorder locally within the threads.

That could be simpler with ldmatrix of 8 short values per thread, if this reordering is done by the GPUs automatically.

So can I interpret your answer like this: There will be bank conflict except when 8 pointers access 4 different banks?

And the ldmatrix can do better than hand because it can load more data if they are successive per time.

Yes. I tried to show you a way to combine 8 ldmatrix operations and distributing the loads of the threads in a way that you avoid the bank conflicts, by cycling, which thread reads from which bank in which of the 8 ldmatrix instructions.

And the ldmatrix version using 32 pointers has a 4-way bank conflict built in (it reads 4 times as many values after all); it could be the case that it more efficiently distributes the reads. If you additionally make sure that you read diverse data, which definitely is in different banks (which you before had used different ldmatrix instructions for), you can definitely avoid bank conflicts.

One cool feature of ldmatrix is that it also builds together 2 short values stored at different locations in shared memory into a 32-bit unsigned integer usable for mma.

So it can do 32, 64 or 128-bit accesses, but distributes the data in 16-bit granularity.

Doing it by hand, the memory access size is the data size, each thread gets.

There are some data access configurations, where ldmatrix is identical to doing it by hand and some, where it is not possible to use ldmatrix, and some where it is much better to use ldmatrix than doing it by hand.

I don’t totally understand the meaning of cycling. Can you give me an example?

In the following I show the general principle (without refering to ldmatrix or groups of 4 banks, which have no bank conflicts, but assume 32 manual 32-bit accesses):

You want to access values with arbitrary pointers from several threads.
As you cannot directly avoid bank conflicts, you go another way.
You put all that data for the mma instruction into the same bank.
You put different data into the second bank, and so on for all banks. ← This part is important; for avoiding bank conflicts in the way shown, we use the possibility that you also want to do separate mma instructions on different data, which we can store in different banks. We mix the data loading for all those mma instructions to distribute, which thread reads from which bank.

Then instead of just loading the data for one mma instruction, you load data for 32 mma instructions from different banks in a loop.

Each thread loads from a different bank in each iteration.
Afterwards the data is resorted.

thread t:
for(int i = 0; i < 32; i++)
    variable[i] = load[bank (i+t) % 32, arbitrary index within bank];

As variable does not depend on the threadid, the loop can be unrolled.
But afterwards we have to resort the loaded values without refering to the thread index.

We rather would have liked to store into

variable[(i+t) % 32]

to use data from specific banks (over all threads) for each mma instruction.

We shifted the problem from bank conflicts to a very difficult resort operation, with different resorts per thread without being allowed to use thread-dependent indices on local variable arrays.

It is not as bad with ldmatrix (and specifically the version reading 8 shorts per thread). Here we have a maximum of a 2-way bank conflict, which we can accept or avoid by simpler resorting (2 iterations instead of 32), so simple selects for resorting.

So the 128-bit ldmatrix gets 8 pointers per group of threads 0/4/8/… storing the result into 4 32-bit unsigned integers per thread.

You just make sure that the 4 integers belong to fixed different bank groups of 4 banks and use them for different mma instructions, e.g.

You read the first integer (2 shorts) from banks 0-7, the second integer from banks 8-15, …

So you get each of the 2 short values within each of the 32-bit integers from arbitrary indices, but you have to make sure that the integers between each other are in different banks.

And you use the integers for different mma instructions. See it, as if you can parallelize the same operation with different data. And you store the different data into the different banks.