Conflict of shfl.sync()

half-0 · November 25, 2024, 2:16am

Is there are any operation needed to be prevented when using shfl.sync()? For example, is there are similar thing like bank conflict for shared memory in registers? If many threads need data from a particular thread, will that be much slower? (For example, thread (i) need data from thread (i / 8) compare with thread (i) need data from thread ((i + 1) % 32))

half-0 · November 25, 2024, 3:12am

Now I have 2 choice for a tensor core program. First is to use shared memory to store the calculation result. Second is to use register, but need 4 shfl.sync() per calculate. I don’t know which will be quicker?

Curefab · November 25, 2024, 10:02am

Rule of thumb is that shfl.sync is as fast as one shared memory access (either read or write).
So if you store from the threads into shared memory and read again for finally storing into global memory, it would be 2 accesses (when avoiding bank conflicts), which is less than 4 shfl.sync.

There are no performance restrictions, any lane can read any other. (There are logical restrictions, when up- or downshifting, or with lanewidth != 32, what is possible at all with the individual instructions, but those restrictions are more for convenience.)

shfl and shared memory is a fast resource, which you can freely use, as long as you do not overdo it, as it is shared between the 4 SM Partitions. So another very rough rule of thumb is that you should use shuffle or shared memory at a maximum for each 4th basic operation.

half-0 · November 25, 2024, 11:41am

So in my case, I should use shared memory more instead of shfl.sync() ? I used to believe accessing to smem is much slower then shfl.sync()…

Curefab · November 25, 2024, 12:08pm

In your case, the shuffle operations are probably not fully optimized yet.

If it is possible to write the result to shared memory once, why do you need 4 shuffle operations?

If only part of the 4 bytes needs to be written, then transfer two or four results instead of just one.

All in all the difference is probably that the shuffle operations need 50% less operations.

half-0 · November 25, 2024, 12:22pm

Actually, I need shfl.sync() because I have 64 idle registers per thread but limited smem, so I want to use registers to store the result of mma. However, due to the layout requirement on PTX ISA, a thread need to provide 4 value per calculation, which is infeasible. So I need shfl.sync() to deliver value between threads.

You are right. I just need to use shfl operation once to load mma’s C matrix, then use shfl operation once to store mma’s D matrix back.

Curefab · November 25, 2024, 12:28pm

Why is 4 values per calculation infeasible with 64 idle registers?

half-0 · November 25, 2024, 12:49pm

I need to calculate 128 pts of 16 out_channels per warp. Each warp have 64 registers. But in every calculation, each threads needs to provide 4 values for C/D matrix : 2pts of 2 out_channels. As the pts needed is totally random, I need at least 2 * 128 = 256 registers per thread to ensure this process. That’s way it is infeasible.

Curefab · November 25, 2024, 12:57pm

If something is random, you probably need to index.
Registers cannot be indexed (except manually by lots of switch case or if).

One exception for registers is that they can be selected by the number of the containing thread (when doing a shuffle). Each of the 32 threads holds one register and during the suffle you can choose which one to read.

Shared memory can be indexed, but you easily get bank conflicts, except you copy each data 32 times to shared memory (takes time and space) and then each thread has its own bank for all the data.

half-0 · November 26, 2024, 12:11pm

Can’t I just write like below?

__device__ f() {
    int reg[64];
    reg[...] = __shfl_sync(0xffffffff, reg[...],     ...);
}

while … means something calculated by threadIdx.x

or if I can only do shfl for a single register like below?

__device__ f() {
    int a;
    int b = __shfl_sync(0xffffffff, a,     ...);
}

Curefab · November 26, 2024, 12:20pm

Both generally compile and work.
However, the first case would use Cuda local memory (for reading and for writing) and be quite slow.

You usually can only avoid it, if indices are known at compile-time.
Sometimes loops work with the loop iterator as index, when the loops are (automatically) unrolled (see #pragma unroll).

With some algorithms it is possible to just use one or a few variables for each thread, but use the thread number, when shuffling as indexing.

half-0 · November 26, 2024, 12:54pm

I don’t understand well. Does that means, if every index can be known at compile-time, the first case is as fast as the second? For example, if every index just related to the threads’ threadIdx.x, can they work well?

Also, if index is related to the input data, what make first case different from the second one? All 64 reg are registers belong to the thread, I can’t see any different between “reg[index]” and “a”.

Curefab · November 26, 2024, 1:27pm

No, the index would have to be the same between all the threads.

The register file can only be indexed by the hardcoded register number within the instruction code.

For dynamic indices (even threadIdx) you (or the compiler) would either have to use different memory (shared, local, global) or use a chain of if or switch case.

half-0 · November 29, 2024, 11:38am

Thanks! I get it. But I encountered a new question now. As the index must to be the same between all the threads, how can I reconstruct the following code to run faster?

__device__ f() {
    // already define pointer A, B
    __half2 src;
    __half2 dst;
    __half2 result[2];
    unsigned *C       = reinterpret_cast<unsigned *> (&result);
    unsigned *D       = reinterpret_cast<unsigned *> (&result);

    ... // some operation
    dst = __shfl_up_sync(0xffffffff, src, 8);
    result[0] = src;
    result[1] = dst;
    __asm__ __volatile__ (
        "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
        "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
        : "=r"(D[0]), "=r"(D[1])
        :
            "r"(A[0]), "r"(A[1]), "r"(A[2]), "r"(A[3]),
            "r"(B[0]), "r"(B[1]),
            "r"(C[0]), "r"(C[1])
    );
    ...
}

I don’t know whether the operation of loading dst & src to result can be eliminated? Can I just use result[0] & result[1] to shfl?

Curefab · November 30, 2024, 9:51am

Normally the compiler and assembler make a good job of putting C, D, result, src, dst into registers and detect, if one is a copy of or a reference to another. So a bit of C++ boilerplate does not necessarily result in additional operations.

What is wrong with this code? Have you looked at Compute Nsight and/or the SASS translation?

From a stylistic point of view, I would recommend moving the mma operation into an inline device function to keep the C++ algorithm clean and easy to read.

half-0 · December 1, 2024, 12:36pm

Thanks! I get it.

system · December 15, 2024, 12:37pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Why compiler don't use registers to store my data? CUDA Programming and Performance	43	88	December 7, 2024
Why is __shfl slower than shared memory CUDA Programming and Performance	7	5832	November 27, 2014
CUDA Shuffle Instruction (Warp-level intra register exchange) CUDA Programming and Performance	8	8928	November 29, 2013
Load data for tensor core CUDA Programming and Performance	23	53	February 5, 2025
Is there a support for copy from shared memory to global memory without using registers? CUDA Programming and Performance cuda	7	160	October 9, 2024
Do I understand the nuances of __syncwarp() and __shfl() correctly? CUDA Programming and Performance	12	309	July 31, 2024
Tensor core mechanism CUDA Programming and Performance	10	138	February 22, 2025
Fast Implementation of (Small-)Table Lookup CUDA Programming and Performance cuda , kernel	14	803	April 26, 2024
Shared mem vs. registers CUDA Programming and Performance	3	1355	October 14, 2009
Optimization suggestions for reading from main memory to registers and share memory CUDA Programming and Performance	10	171	May 22, 2024

Conflict of shfl.sync()

Related topics