Independent thread scheduling not working

yo.
im trying to learn how independent thread scheduling works.
i have simple kernel that runs heavy or light task based on warp lane.
if warp lane is odd then we do three global memory read/write and if it’s even we do nothing.

__global__ void mb_its(int *mem, struct timing *timing)
{
        int idx = threadIdx.x;
        int lane = idx % 32;

        timing[lane].start = clock();
        if (lane % 2)
        {
                if (idx == 1) printf("%d in\n", idx);
                timing[lane].inloop = clock();
                if (idx == 1) printf("%d mem1\n", idx);
                mem[idx] = mem[idx] + idx;
                if (idx == 1) printf("%d mem2\n", idx);
                mem[idx] = mem[idx] + idx;
                if (idx == 1) printf("%d mem3\n", idx);
                mem[idx] = mem[idx] + idx;
                if (idx == 1) printf("%d out\n", idx);
                timing[lane].inloopend = clock();
        }
        else
        {
                if (idx == 0) printf("%d in\n", idx);
                timing[lane].inloop = clock();
                if (idx == 0) printf("%d out\n", idx);
                timing[lane].inloopend = clock();
        }
        timing[lane].outloop = clock();
}

my expectation here is NOT to see prints of thread 0 and then thread 1 (or vice versa) in order. what i have understand from ITS says prints must be like this:
1 in
0 in
1 mem1
0 out

but its like this:
1 in
1 mem1
1 mem2
1 mem3
1 out
0 in
0 out

why??

compiling with:

nvcc main.cu

also tried:

nvcc main.cu -rdc=true

running on GTX 1650 TI (CC 7.5)
nvcc version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

When people want to have the benefit if ITS, its usually suggested that they compile for a GPU architecture where ITS is available, that would be cc7.0 or higher. On recent CUDA toolkits (in your example, where you have provided no arch specification at compile time) you may be compiling for an architecture like cc5.2

But even if you address that, I don’t think your expectations are valid. ITS does not guarantee to advance one warp fragment by some amount, then another warp fragment by some amount, in the absence of any other conditions. Your expected sequence shows an alternation between 0 and 1 based, I guess, on C++ lines of code? I’m not really sure what your logic is, but there is nothing in the code that enforces that, and certainly the executing machine has no knowledge of C++ code structure.

The ITS warp scheduler is allowed to push a given warp fragment forward at least until there is reason not to, e.g. a barrier, a need for synchronization, or a need for intra-warp communication (your code shows nothing like that). Stated differently, there is no specification for how “far” a warp fragment will proceed/be scheduled, even with ITS. What ITS does is enable the possibility for warp fragment execution, which enables the other possibilities. And it seems self-evident to me that this is achieved by cooperation between the compiler and the machine itself.

CUDA, including ITS, provides no ordering of thread execution, except that which you explicitly enforce. ITS allows for intra-warp cooperation by enabling the possibility of warp fragment progress, which might not have happened on previous architectures. ITS doesn’t guarantee a specific fragment execution order.

Here is a variation of your code that would “force” the alternating warp-fragment behavior that ITS makes possible:

__global__ void mb_its(int *mem, struct timing *timing)
{
        int idx = threadIdx.x;
        int lane = idx % 32;

        timing[lane].start = clock();
        if (lane % 2)
        {
                if (idx == 1) printf("%d in\n", idx);
                timing[lane].inloop = clock();
                __shfl_sync(0x3, mem[idx], idx^1);
                if (idx == 1) printf("%d out\n", idx);
                timing[lane].inloopend = clock();
        }
        else
        {
                if (idx == 0) printf("%d in\n", idx);
                timing[lane].inloop = clock();
                __shfl_sync(0x3, mem[idx], idx^1);
                if (idx == 0) printf("%d out\n", idx);
                timing[lane].inloopend = clock();
        }
        timing[lane].outloop = clock();
}

The usage of __shfl_sync() as I have it here would be illegal coding on pre-volta architectures. It is permissible on Volta+, however.

1 Like