Is there are any operation needed to be prevented when using shfl.sync()? For example, is there are similar thing like bank conflict for shared memory in registers? If many threads need data from a particular thread, will that be much slower? (For example, thread (i) need data from thread (i / 8) compare with thread (i) need data from thread ((i + 1) % 32))
Now I have 2 choice for a tensor core program. First is to use shared memory to store the calculation result. Second is to use register, but need 4 shfl.sync() per calculate. I don’t know which will be quicker?
Rule of thumb is that shfl.sync
is as fast as one shared memory access (either read or write).
So if you store from the threads into shared memory and read again for finally storing into global memory, it would be 2 accesses (when avoiding bank conflicts), which is less than 4 shfl.sync
.
There are no performance restrictions, any lane can read any other. (There are logical restrictions, when up- or downshifting, or with lanewidth != 32, what is possible at all with the individual instructions, but those restrictions are more for convenience.)
shfl
and shared memory is a fast resource, which you can freely use, as long as you do not overdo it, as it is shared between the 4 SM Partitions. So another very rough rule of thumb is that you should use shuffle or shared memory at a maximum for each 4th basic operation.
So in my case, I should use shared memory more instead of shfl.sync() ? I used to believe accessing to smem is much slower then shfl.sync()…
In your case, the shuffle operations are probably not fully optimized yet.
If it is possible to write the result to shared memory once, why do you need 4 shuffle operations?
If only part of the 4 bytes needs to be written, then transfer two or four results instead of just one.
All in all the difference is probably that the shuffle operations need 50% less operations.
Actually, I need shfl.sync() because I have 64 idle registers per thread but limited smem, so I want to use registers to store the result of mma. However, due to the layout requirement on PTX ISA, a thread need to provide 4 value per calculation, which is infeasible. So I need shfl.sync() to deliver value between threads.
You are right. I just need to use shfl operation once to load mma’s C matrix, then use shfl operation once to store mma’s D matrix back.
Why is 4 values per calculation infeasible with 64 idle registers?
I need to calculate 128 pts of 16 out_channels per warp. Each warp have 64 registers. But in every calculation, each threads needs to provide 4 values for C/D matrix : 2pts of 2 out_channels. As the pts needed is totally random, I need at least 2 * 128 = 256 registers per thread to ensure this process. That’s way it is infeasible.
If something is random, you probably need to index.
Registers cannot be indexed (except manually by lots of switch case or if).
One exception for registers is that they can be selected by the number of the containing thread (when doing a shuffle). Each of the 32 threads holds one register and during the suffle you can choose which one to read.
Shared memory can be indexed, but you easily get bank conflicts, except you copy each data 32 times to shared memory (takes time and space) and then each thread has its own bank for all the data.
Can’t I just write like below?
__device__ f() {
int reg[64];
reg[...] = __shfl_sync(0xffffffff, reg[...], ...);
}
while … means something calculated by threadIdx.x
or if I can only do shfl for a single register like below?
__device__ f() {
int a;
int b = __shfl_sync(0xffffffff, a, ...);
}
Both generally compile and work.
However, the first case would use Cuda local memory (for reading and for writing) and be quite slow.
You usually can only avoid it, if indices are known at compile-time.
Sometimes loops work with the loop iterator as index, when the loops are (automatically) unrolled (see #pragma unroll
).
With some algorithms it is possible to just use one or a few variables for each thread, but use the thread number, when shuffling as indexing.
I don’t understand well. Does that means, if every index can be known at compile-time, the first case is as fast as the second? For example, if every index just related to the threads’ threadIdx.x, can they work well?
Also, if index is related to the input data, what make first case different from the second one? All 64 reg are registers belong to the thread, I can’t see any different between “reg[index]” and “a”.
No, the index would have to be the same between all the threads.
The register file can only be indexed by the hardcoded register number within the instruction code.
For dynamic indices (even threadIdx) you (or the compiler) would either have to use different memory (shared, local, global) or use a chain of if or switch case.
Thanks! I get it. But I encountered a new question now. As the index must to be the same between all the threads, how can I reconstruct the following code to run faster?
__device__ f() {
// already define pointer A, B
__half2 src;
__half2 dst;
__half2 result[2];
unsigned *C = reinterpret_cast<unsigned *> (&result);
unsigned *D = reinterpret_cast<unsigned *> (&result);
... // some operation
dst = __shfl_up_sync(0xffffffff, src, 8);
result[0] = src;
result[1] = dst;
__asm__ __volatile__ (
"mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
"{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
: "=r"(D[0]), "=r"(D[1])
:
"r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),
"r"(B[0]), "r"(B[1]),
"r"(C[0]), "r"(C[1])
);
...
}
I don’t know whether the operation of loading dst & src to result can be eliminated? Can I just use result[0] & result[1] to shfl?
Normally the compiler and assembler make a good job of putting C, D, result, src, dst into registers and detect, if one is a copy of or a reference to another. So a bit of C++ boilerplate does not necessarily result in additional operations.
What is wrong with this code? Have you looked at Compute Nsight and/or the SASS translation?
From a stylistic point of view, I would recommend moving the mma operation into an inline device function to keep the C++ algorithm clean and easy to read.
Thanks! I get it.