Padding of mma operation

half-0 · December 18, 2024, 3:41pm

I use fp16 mma m16n8k16 for calculation, the kernel is like below:
map contains some integer index and -1, when map[i] = -1, it means that index don’t need to be calculate (just for padding, because mma need to align to 8)

__global__ f(int* map, ......) {
      __shared__ result[...];
      half tmp_result[4];
      ...
      unsigned *C       = reinterpret_cast<unsigned *> (&tmp_result);
      unsigned *D       = reinterpret_cast<unsigned *> (&tmp_result);

      // load result
      tmp_result[0] = result[max(map[0], 0)];
      tmp_result[1] = result[max(map[1], 0)];
      tmp_result[2] = result[max(map[2], 0)];
      tmp_result[3] = result[max(map[3], 0)];

      __asm__ __volatile__ (
          "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
          "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
          : "=r"(D[0]), "=r"(D[1])
          :
              "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),
              "r"(B[0]), "r"(B[1]),
              "r"(C[0]), "r"(C[1])
      );
      
      if (map[0] != -1) {result[map[0]] = tmp_result[0];}
      if (map[1] != -1) {result[map[1]] = tmp_result[1];}
      if (map[2] != -1) {result[map[2]] = tmp_result[2];}
      if (map[3] != -1) {result[map[3]] = tmp_result[3];}
}

will there be any more efficient way or elegent way to realize the code? I think use 0 as the default is not a good way to avoid bank conflict.

Curefab · December 18, 2024, 3:59pm

If you load the same address there is no bank conflict, only if you load from the same bank, but from a different address.

half-0 · December 18, 2024, 4:04pm

Thanks！Some if there are other better way to realize this padding? like some api?

half-0 · December 18, 2024, 4:06pm

Also, if I want to use ldmatrix, will it be faster than I do load by hand in this case?

Curefab · December 18, 2024, 6:16pm

ldmatrix is useful, if thread 0/4/8/12/…/28 and 1/5/9/13/…/29 and so on process short values, which are stored in a successive fashion in shared memory.

So the short for thread 4 is directly stored after the short for thread 0.

I do not know, how your data is stored (or even the values of map), so it is difficult to say, whether there is a better way.

In your case reading individual short values (result is a half type with 2 bytes per element) does not seem to be the most efficient, as Cuda is optimized for 4 bytes accesses (and also accepts 8 bytes or 16 bytes); so try to use half2 or ldmatrix, if possible.

If it is not possible, it depends, whether shared memory bandwidth is a bottleneck in your case. If it is, you can slightly reduce the load with a combination of shared memory read of 2 * 4 bytes (combining two transactions), keeping 2 shorts, and shuffling the other 2 shorts in one transaction to another thread, needing those two; and at the same time getting two shorts back. With it you replace reading 4 shorts (4 transactions) with reading 2*4 bytes (2 transactions) and shuffling 4 bytes (1 transaction), i.e. 25% less shared memory usage. Perhaps even less, if you can directly read from global memory. The necessary selection and recombination of the shorts before and after shuffling is often faster than shared memory, if shared memory is the bottleneck.

half-0 · December 19, 2024, 7:43am

It is sad that map is transferred from outside, so though my data stored in order, I can’t access them in order. It might cause bank conflict due to the map array.
Wait, I don’t understand this sentence: “if thread 0/4/8/12/…/28 and 1/5/9/13/…/29 and so on process short values, which are stored in a successive fashion in shared memory.” Do you mean that if data processed by that specific 8 threads are stored successive, ldmatrix can do much better then by hand?

half-0 · December 19, 2024, 7:49am

Like if threads 0/4/8…/28 's data are stored in data[0] - data[7], 1/5/9…/29 's data are stored in data[128] - data[135], 2/6/10…/30 's data are stored in data[256-263], ldmatrix can do very well then?

half-0 · December 19, 2024, 7:52am

I don’t understand the ldmatrix (also the stmatrix) mechanism clearly.

Curefab · December 19, 2024, 8:12am

Yes, exactly. With the practical addition that each of those thread groups may load more than one data block (actually, either 2, 4 or 8 data blocks), whose indices do not have to relate to each other.

The threads with the same remainder dividing by 4 make sense looking at the data input requirements of the mma instruction, which imagines a 8x4 thread configuration.

Together with the possibility to store into shared memory in a certain way and to swap rows or columns in mma inputs and outputs you have quite a lot of flexibility to optimize memory accesses.

half-0 · December 19, 2024, 8:19am

Does it possible to have bank conflict when using ldmatrix? like different thread groups may access to same bank.

Curefab · December 19, 2024, 8:22am

Is the contents of map actually the same over all your threads? Probably no?

You get 4 results per thread, want to save them at an arbitrary position and only conditionally.

Is there any regularity? E.g. that certain threads can only store at certain indices or that each mma instruction (in a possible loop) can only store at certain indices depending on the iteration?

Is the same map reused over several iterations?

half-0 · December 19, 2024, 8:30am

The code clip is just a simplified demo of my real code. I can alter my condition into below question:
I need a 16 * 8 fp16 matrix for mma. I can ensure each column’s 16 data are aligned and successive in shared memory. But for data between different column, I can ensure nothing.
For example, assume column 0 access data[0 - 15], will that be bank conflict if other column access data[64 - 79] (as they are on same bank)?

half-0 · December 19, 2024, 8:33am

I ensure each column will access data[16 * k + 0 - 16 * k + 15]. Will that be any extra limitation to k for each column for the efficiency of ldmatrix?

Curefab · December 19, 2024, 9:13am

Let’s assume the version of ldmatrix, reading 2 short values.

You have 2 * 4 = 8 pointers to 16-byte blocks.

Two for each of the 4 groups of 8 threads.

Each pointer accesses 4 banks (2 * 8 bytes = 4 * 4 bytes = 16 bytes).
To avoid bank conflicts, the 8 pointers will need to access different groups of 4 banks.

I believe quite firmly from my past usage of ldmatrix (but I am not 100.0% sure, you have to test with Nsight Compute), that there is no restriction, which pointer points to which group of 4 banks.

It gets more complicated with the ldmatrix variant loading 8 short values per thread and having 32 pointers. Because it needs 4 shared memory accesses anyway. Can they be arbitrarily aligned within the 128 bytes and still the optimum access pattern is found?

Generally, if you cannot assure that the pointers point to different groups of 4 banks, and still want to avoid bank conflicts, you have to combine several accesses (for different mma instructions) and reorder.

You have accesses, which are in different groups of 4 banks: A, B, C, D, E, F, G, H.

In the first ldmatrix call threads 0/4/8/… load any index within A and B (2 pointers), threads 1/5/9/… load any index within C and D, and so on.

In the second ldmatrix call, threads 0/4/8/… load any index within C and D, threads 1/5/9/… load any index within E and F, and so on.

Afterwards you have to reorder locally within the threads.

That could be simpler with ldmatrix of 8 short values per thread, if this reordering is done by the GPUs automatically.

half-0 · December 19, 2024, 9:18am

So can I interpret your answer like this: There will be bank conflict except when 8 pointers access 4 different banks?

half-0 · December 19, 2024, 9:18am

And the ldmatrix can do better than hand because it can load more data if they are successive per time.

Curefab · December 19, 2024, 9:23am

Yes. I tried to show you a way to combine 8 ldmatrix operations and distributing the loads of the threads in a way that you avoid the bank conflicts, by cycling, which thread reads from which bank in which of the 8 ldmatrix instructions.

And the ldmatrix version using 32 pointers has a 4-way bank conflict built in (it reads 4 times as many values after all); it could be the case that it more efficiently distributes the reads. If you additionally make sure that you read diverse data, which definitely is in different banks (which you before had used different ldmatrix instructions for), you can definitely avoid bank conflicts.

Curefab · December 19, 2024, 9:26am

One cool feature of ldmatrix is that it also builds together 2 short values stored at different locations in shared memory into a 32-bit unsigned integer usable for mma.

So it can do 32, 64 or 128-bit accesses, but distributes the data in 16-bit granularity.

Doing it by hand, the memory access size is the data size, each thread gets.

There are some data access configurations, where ldmatrix is identical to doing it by hand and some, where it is not possible to use ldmatrix, and some where it is much better to use ldmatrix than doing it by hand.

half-0 · December 19, 2024, 9:32am

I don’t totally understand the meaning of cycling. Can you give me an example?

Curefab · December 19, 2024, 9:47am

In the following I show the general principle (without refering to ldmatrix or groups of 4 banks, which have no bank conflicts, but assume 32 manual 32-bit accesses):

You want to access values with arbitrary pointers from several threads.
As you cannot directly avoid bank conflicts, you go another way.
You put all that data for the mma instruction into the same bank.
You put different data into the second bank, and so on for all banks. ← This part is important; for avoiding bank conflicts in the way shown, we use the possibility that you also want to do separate mma instructions on different data, which we can store in different banks. We mix the data loading for all those mma instructions to distribute, which thread reads from which bank.

Then instead of just loading the data for one mma instruction, you load data for 32 mma instructions from different banks in a loop.

Each thread loads from a different bank in each iteration.
Afterwards the data is resorted.

thread t:
for(int i = 0; i < 32; i++)
    variable[i] = load[bank (i+t) % 32, arbitrary index within bank];

As variable does not depend on the threadid, the loop can be unrolled.
But afterwards we have to resort the loaded values without refering to the thread index.

We rather would have liked to store into

variable[(i+t) % 32]

to use data from specific banks (over all threads) for each mma instruction.

We shifted the problem from bank conflicts to a very difficult resort operation, with different resorts per thread without being allowed to use thread-dependent indices on local variable arrays.

It is not as bad with ldmatrix (and specifically the version reading 8 shorts per thread). Here we have a maximum of a 2-way bank conflict, which we can accept or avoid by simpler resorting (2 iterations instead of 32), so simple selects for resorting.

So the 128-bit ldmatrix gets 8 pointers per group of threads 0/4/8/… storing the result into 4 32-bit unsigned integers per thread.

You just make sure that the 4 integers belong to fixed different bank groups of 4 banks and use them for different mma instructions, e.g.

You read the first integer (2 shorts) from banks 0-7, the second integer from banks 8-15, …

So you get each of the 2 short values within each of the 32-bit integers from arbitrary indices, but you have to make sure that the integers between each other are in different banks.

And you use the integers for different mma instructions. See it, as if you can parallelize the same operation with different data. And you store the different data into the different banks.

Topic		Replies	Views
Use of ldmatrix CUDA Programming and Performance	7	175	December 8, 2024
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	9529	January 16, 2025
Understanding the behaivor of ldmatrix in terms of shared memory access CUDA Programming and Performance cuda	2	1427	January 12, 2024
About compute accuracy CUDA Programming and Performance	22	63	February 10, 2025
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16300	January 30, 2011
Load data for tensor core CUDA Programming and Performance	23	47	February 5, 2025
Using Shared Memory in CUDA C/C++ Technical Blog	36	1989	October 8, 2020
Help understanding bank conflicts in transpose example CUDA Programming and Performance	5	6655	February 8, 2009
Resolve 1D shared memory bank conflict with paddling CUDA Programming and Performance cuda , kernel	9	65	September 1, 2024
Requesting clarification for Shared Memory Bank Conflicts and Shared memory access? CUDA Programming and Performance hw , cuda	11	3984	January 23, 2024

Padding of mma operation

Related topics