Back to SIMD

Hi. I have low-divergent code (just a little if’s), and several complex functions, which return the same value for all threads in current iteration (in next iteration the result value will be differ). I need to call those functions once in the current step and share value among the threads of block. In the SIMD architecture this is easy, but $4.1 SIMT Architecture of Programming guide says:

Starting with the NVIDIA Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp.

I suppose, I have to use $7.26. Asynchronous Barrier for synchronization, but will not it be to costly? If I sure that SIMD flow is better for my project, may I compile it (with flags may be) to run as SIMD? Or the barrier is faster in CUDA?

PS: I want extreme performance, so I need to do all the possible to speedup the throughput.

UPD: I’ve found $7.19. Warp Vote Functions, probably it is about the topic. In my opinion something like SIMD “critical sections” would be very nice in the future.

Hi dem,
you can always insert __syncwarp() at the positions in your program, where you want to make sure that old data is written and can be read by another thread of the warp (but not between warps). It is quite cheap. Normally (e.g. without conditionals) the threads do not diverge and it is more or less a no-op (it still has to be read and processed).

There are several ways to combine and exchange data between different threads

  • warp shuffle
  • warp vote (which you mentioned)
  • shared memory (works also between different warps within a block)
  • global memory (works also between different blocks)

The synchronization is either implicit in the instructions (they often contain a _sync postfix or has to be done with synchronization primitives, barriers, … or with atomic operations.

Dear Curefab, I’m professional in parallel CS, though a novice in CUDA & accelerators. Now I love to study and “hack” them to give some experience on it… I found very interesting instruction:
coalesced_group active = coalesced_threads();
It (probably) allows to gather all active threads, and maybe we can do some more sync to wait others, if the quantity not enough. But. After we collect the threads, we must sure that they will not diverge for some time (i.e. till pass our block of instructions). I do not see construction, which assure zero-divergence, and one “happy” moment this code may be unstable, because threads have their own instruction counter.

Thank you for the overview.

Recently I’ve written a paper, where I give definitions of synchronization primitives, atomic operations and so on. SIMD is “continuous synchronization” despite of “event synchronization”, and it is almost not a synchronization at all – it has no overhead. Atomic instructions is not a synchronization too. SIMD is very perspective in this context, it would bad if this architecture will be shadowed for CPU-compatibility.

Any construction, that guarantee no-divergence during codeblock would be significant. Maybe on level of separate function…

Hi,
thank you for this perspective and better insights into different kinds of synchronization.
I’m a practitioner with years of Cuda experience, but less from a parallel CS theoretical point of view.

As far as I understand, there is no way to guarantee that threads within a warp will not diverge, at least not on C++ or PTX level. And if there is a guarantee at SASS level, it would not be a useful guarantee, as SASS is not publicly documented by Nvidia.

Code, which relies on coalesced_threads() staying true at least for a time without any conditionals afterwards may work in practice, but would be UB in a strict sense.

Nevertheless coalesced_threads is (maybe) useful in generic code, iterative code or subroutines, where you can use the initially active threads for on-the-fly parallelism, i.e. a parallel algorithm uses those threads and syncs those threads afterwards.

// generic function
__device__ void do(int a)
{
    // find out, which threads are active and somehow let them cooperate, e.g. cooperatively load data in a coalesced way
    // and afterwards each thread processes parts of the data
    coalesced_group active = coalesced_threads();
}

__global__ void kernel()
{
    if (threadIdx.x < 16)
        do(1);
    else
        do(2);
}

I think the official way (and as I currently handle it) is to not rely on non-divergence at all for correctness, but only rely on it for performance.

I would not be so sure, that the change was solely for CPU compatibility, but perhaps for internal design reasons/flexibility or for making it possible to accelerate algorithms with non-optimal source code better. E.g. there was a trend in several of the recent years to improve L1 cache and lessen the coalescing requirements.
It is some balance for keeping the SMs simple (to dedicate as much die area to computation), but also to deliver data fast enough (area for caches; with Ada Lovelace Nvidia increased L2 by a lot), but also to optimize compute usage/occupancy for a wide array of better or worse CUDA optimized algorithms.

So again, there is no possibility to guarantee no-divergence, even locally.
You would assume no-divergence for performance reasons.
But wherever it involves program correctness, you insert synchronization instructions.

  • Synchronization instructions are implicit with voting or shuffle instructions.
  • Warp-wide synchronization with __syncwarp() is cheap (as long as there is assumed no-divergence anyway) and would be only needed for memory operations (shared or global).

=> So there should not be a slow-down of SIMD.

There are only few scenarios, where you actually would need lots of synchronization.

The old warp-wide reduction code using shared memory comes to mind:

unsigned tid = threadIdx.x;
int v = 0;

v += shmem[tid+16]; __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+8];  __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+4];  __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+2];  __syncwarp();
shmem[tid] = v;     __syncwarp();
v += shmem[tid+1];  __syncwarp();
shmem[tid] = v;

(source: taken from Using CUDA Warp-Level Primitives | NVIDIA Technical Blog (listing 8) after a quick google search for an example)

However, exactly this code with shared memory usage can be nicely replaced by warp shuffle instructions with their implicit synchronization.

So any remaining needs for __syncwarp() will be few and far in between (e.g. at the end of each iteration of random read/write accesses into shared memory or after cooperatively loading data from global to shared memory).

It depends on “thinking”, experience of programmers. Perfect example – synchronize every line… I think, it is a mistake somewhere in the approach – yesterday I had a quick look on OpenCL 3.0, there are barriers and reduce (fork-join) operators too… I develop an idea of reducing synchronization, but localization of synchronization to small instruction (as NVidia does) does not help. Maybe it happens because people thinks of atomic operations as a such “small synchronized instructions” – it is not true. Synchronization needs for thread waiting for getting agreement of some event, and may lasting infinite time (in theory). Atomic operations has fixed time-ticks, and should not consider as synchronous. The same as MOV – it is atomic just because parallel bus, otherwise two MOVs might shuffle bits in the result.

For example, CAS is atomic, but CAS with result “succeed” or “not succeed” becomes synchronized. Because programmers make infinite loop there. Any synchronized operation is power consumption.

Right now I’m developing a cache with zero-synchronization. This is possible to suppose a such sequence of instructions that simultaneous applying them do not break consistency. No synchronization at all. (So, I need to look up all the variants of permutations and I use NVidia accelerator for it).

When I go to no synchronization, NVidia wants to synchronize everything, so I just want to tell somebody, that this way may be false…

A SIMD architecture would be always synchronous. SIMT is like a mixture; so Nvidia does not synchronize everything, but one can manually choose, when.

Are you talking about the difference between synchronization at certain points vs. running in lockstep?

SIMD “continuous synchronization” it is free of charge – like pendulums of clocks. No power need to keep them synchronous. I like a term “conjugation to hardware”. They are synchronous not between each other, but both with hardware (because of frequency clocks). I hesitate to call it “synchronization” at all – there are no locks at least.

By the way, paradox: in another relative coordinate system (Einstain) they will not be “synchronous”.

(And it’s very easy for NVidia to join thread counters together now, but with the time and new standards may become not)

The threads within a warp are also mostly convergent.
If they diverge, it is a huge performance hit. That is avoided as much as possible.

The threads cannot independently issue instructions.
(At least arithmetic) instructions are always issued for all 32 threads, inactive threads occupy the execution unit, but do not use it. So it is full SIMD.
Even with the Independent Thread Scheduling, there is just an active mask with flags, showing which thread is active at the current instruction.
If some threads are at another program location, they have to wait until this other program location is activated.

Nvidia took the guarantee of synchronous behaviour out of Cuda C++. Now programmers have to be explicit, if they need synchronization. Not sure, if that was necessary. But the hardware still shows SIMD behaviour.

I’ve already ran my permutations in threads (the same algorithm), firstly they go together instruction-to-instruction, but then they diverges. I could add some sync instructions, but where exactly?.. Maybe they are still synchronous within a warp – it needs for time to test it.

Tomorrow they will have millions threads maybe, and there always be idle ones, so my approach when we collect idle threads with no synchronization and run them all to new task will work perfectly. (They keep silence, so I will too :)

The threads typically diverge at if/else or other conditional constructs. The ? ternary operator keeps the threads a bit more divergent.

Use Compute Nsight to see, whether they diverge.

One could use synchronization functions __syncwarp after each conditional block. You definitely have to use synchronization functions, if you read or write to shared or global memory and expect the data to be visible to other threads. It is not only about synchronization, but about what the compiler is allowed to assume.

E.g.

my_shared[0] = 4;
[...]
int a = my_shared[0];

Can the compiler assume that the value definitely is 4 and can it skip the (write and) read? Or may another thread have changed the shared memory in the meantime. (In this example also volatile helps/is necessary.)

The divergence we are talking about is only relevant for warps of 32 threads, not for millions of threads. Beyond warp you only synchronize, if you exchange data.

I feel misunderstanding here – this is not according with “own instruction counter per thread” and “independent threads”. Either I must study it a bit more, or read something to opponent to you.

By the way, I see another easier solution if this is not like that you say – let the device in the SIMD mode inserts “sync” commands when threads diverges on 1/2 of cycle.

About compiler – it would be nice if it vectorise functions instead of me :) Now I transfer my C++ program to accelerator, and it works “as is”, but all functions I must rewrite for being performed by several threads – it’s faster in 64 times in my case.

There are compilers, which vectorize for different architectures (e.g. OpenMP).

The Cuda system keeps instructions counters per thread, but it can only switch between groups (=groups within the 32 threads of one warp) of threads with the same value in the instruction counter. So it cannot execute different locations in the program at once simultaneously.

But if there is a

if (threadIdx & 2) {
    doThis();
} else {
    doThat();
}

it can jump between the if and else block back and forth (e.g. if doThis() and doThat() comprise multiple instructions) until the threads hopefully reconverge at the closing brackets.

Okay, let me some more time to know details…
But because you help me a lot, I leave here right understanding of atomic operation. Let we add an operation of cyclic shift between 3 arguments. It looks like MOV, so the hardware realization is not so difficult. Then we immediately could add nodes in linked list (and tree – i.e. highly-scalable DBs) with zero concurrency overhead:

newbie -> head -> newbie.next

compare with CAS:

newbie.next := head
CAS(head := newbie | head == newbie.next)
infinite loop (while)

Any infinite loop is a lock – synchronization. To make INC atomic it probably is necessary to have “active memory”, which could perform some operations, or something like this.
(And I’m looking for such new operations, but little people interesting in it yet)

So, you are TRUE. It seems, my Rtx3050 does efforts to reconverge threads in warp. I have this test (compiled with sm60 arch option), flows divides to two groups, the 1st has more processor cycles, but finally they both reconverges with each other:

__device__ static inline constexpr uint test(uint val) {
    return val >> 1;
}
__global__ void testsync() {
    uint counter(3 * 1024*1024*1024), f(0), m0(0), m1(0), m2(0);
    do {
        if (threadIdx.x & 1) {
            m0 += (__ballot(1) != 0xFFFFFFFF);
            m0 = test(m0 * 2);
        } else {
            m1 += (__ballot(1) != 0xFFFFFFFF);
        }
        m2 += (__ballot(1) != 0xFFFFFFFF);
        f = test(++counter);
    } while (f);
    if (threadIdx.x < 2) printf("%u %u: %u %u %u\n", counter, f, m0, m1, m2);
}
testsync<<<1*9, 256>>>();

Output:

0 0: 0 1073741824 0
0 0: 1073741824 0 0
...

Right now, I think, the danger is just from the side of standards. NVidia already write in the guide, that flows may diverge, but they do not. It would be break the SIMD architecture. I think it’s because misunderstanding of little differences of “simultaneous” and “synchronous”, the same as “atomic” and “lockable”. The physicals principles are COMPLETELY DIFFERENT. And we must not loose the best ideas.

I would prefer if they do SIMD “critical section” that guarantee the flows are simultaneous (i.e. instruction-to-instruction correspondence). And bring back operators WITHOUT “_sync”, such as __shfl, __shfl_up, __shfl_down, and __shfl_xor and _any, __all, __ballot.

Because in SIMD environment they are not “synchronized” but just “simultaneous” instructions. This is correct methodologically, terminologically and ideologically. Let programmers be not confused.

CUDA is not SIMD. It is SIMT.

explicit synchronization is required in cases where threads cooperate

There is no way to force SIMD without using the methodology covered in the blog. The methodology of compiling for cc6.0 (whatever it may be) will disappear when cc6.0 support is dropped.

The way to request changes to CUDA or CUDA documentation is to file a bug.

1 Like

The current way also gives Nvidia more flexibility for future architectures. Even in cases, where currently threads do not diverge, programmers have to mark any synchronization point. That keeps programs correct, even if run on different architectures.

I do not understand, why you do not like the _sync variants better, just because of the name? They show exactly the SIMD behaviour, you want or expect for any operation. And if the threads are diverged at that point, the threads converge.

Maybe it is not obvious. The SIMD “critical section” gives even more flexibility – the threads are free to diverge outside, nobody lost… I will try to describe it in the ticket. I follow “no synchronization” approach because it is scalability.

Maybe “over-granularity” help to catch. If we separate (usual) critical section to small critical “atomic” operations (in fact small critical sections) we just have several small critical sections. And lost in performance.

But it is the way how to find out new purely atomic instructions – to tight critical sections as small as possible, but without endless loops. Pure atomic operation eliminates corresponding critical section if one can implement it in hardware (like parallel bus and MOV).

Sorry, I’ll be late with tickets – right now I’m busy with my program for CUDA. In the 20ths of November I must provide results on local NSCF’24 conference… Maybe before I’ll send a paper to IEEE or somewhere – about CAS with result. It is a common practice to name it “atomic” and how to break the trend…

What about sync – I keep in mind physical model of resonator “Quality factor” (and power dissipation), but still am not able to fit it to the situation. I will think about it a bit later.

“Magic sequence” of lock-free cache is here, but it could be better. I desire to use only MOVs. (locks here is just for result-free CAS instruction emulation)

   volatile int share[3] = {-100,-110,-120};

    void put(int uid) {
        int local[4];
        local[0] = share[0];
        spin.lock(); local[3] = ++share[3]; spin.unlock();
        local[1] = share[1];
        spin.lock(); if (share[0] == local[0]) share[0] = uid; spin.unlock();
        if (share[0] != uid) return;
        spin.lock(); if (share[1] == local[1]) share[1] = uid; spin.unlock();
        if (share[1] != uid) return;
        share[2] = local[3];
    }

    int get(int uid) {
        int local[4];
        local[2] = share[2];
        if (share[1] != uid) return null;
        local[0] = share[0];
        if (share[3] != local[2]) return null;
        return local[0];
    }

Is share within __shared__?
Those are quite a lot of accesses.
Is ++share[3] an atomic access or it does not need to be atomic?
Is uid the thread-specific thread id or an uid for which line in the program has done the access?

(Each separate instruction is atomic there, but no any synchronization among them – any delays, any different thread instruction shuffling are possible. it’s C++ code and ordinary threads – get() returns either the result, or null)