Warp-synchronous programming in the presence of warp divergence

Hi,

I wrote the following experimental kernel which I launch using 1 block and 32 threads. The kernel takes 5 inputs:

    2 arrays (A and B) of 32 integers initialized to all 0

    3 integers (b0, b1, and b2) between [0,31]

The output of the kernel should be:

A[i] = 0     for       i < b0

A[i] = tid   for b0 <= i < b1

A[i] = -tid  for b1 <= i < b2

A[i] = 0     for b2 <= i

and

B[i] = 0           for       i < b0

B[i] = A[tid + 1]  for b0 <= i < b2

B[i] = 0           for b2 <= i

Here’s the kernel:

__global__ void kernel(int *A, int *B, int b0, int b1, int b2)

{

	if (tid < b1)

	{

		if (tid < b0) return;

		x = tid;

	} else {

		if (tid >= b2) return;

		x = -tid;

	}

	// warp-synchronous programming

	A[tid] = x;

	B[tid] = A[tid + 1];

}

Since the GPU executes the branches serially (the threads only reconverge at the immediate postdominator which happens to be at the end of the kernel) and I am using warp-synchronous programming, I was expecting to have a race condition. However, when I run this kernel on a NVIDIA S1070, the results are correct. I tried changing B[tid] = A[tid + 1] for B[tid] = A[tid - 1] but the results are the same.

Any thoughts why the kernel doesn’t fail?

Thank you.

Threads in the same warp are syncronised

Warps reconverge at the earliest opportunity. Why do you think they would only reconverge at the end of the kernel?

Because of the return statement, I guess. If you consider return as equivalent to “goto end_of_function”, then the immediate post-dominator is indeed at the end of the function.

For the sm_10 target, nvcc (4.0 RC2) generates:

code for sm_10

		Function : _Z6kernelPiS_iii

	/*0000*/     /*0xa000000104000780*/ 	I2I.U32.U16 R0, R0L;

	/*0008*/     /*0x3000d3fd6c20c7c8*/ 	ISET.S32.C0 o [0x7f], g [0x9], R0, LE;

	/*0010*/     /*0xa000b00300000000*/ 	SSY 0x58;

	/*0018*/     /*0x1000800300000280*/ 	BRA C0.NE, 0x40;

	/*0020*/     /*0x3000d1fd6c20c7c8*/ 	ISET.S32.C0 o [0x7f], g [0x8], R0, LE;

	/*0028*/     /*0x3000000300000500*/ 	RET C0.EQU;

	/*0030*/     /*0x100000050403c780*/ 	MOV R1, R0;

	/*0038*/     /*0x1000b00300000780*/ 	BRA 0x58;

	/*0040*/     /*0x3000d5fd6c2107c8*/ 	ISET.S32.C0 o [0x7f], g [0xa], R0, GT;

	/*0048*/     /*0x3000000300000500*/ 	RET C0.EQU;

	/*0050*/     /*0x3000800500000003*/ 	IADD32I R1, -R0, 0x0;

	/*0058*/     /*0x30020001c4100782*/ 	SHL.S R0, R0, 0x2;

	/*0060*/     /*0x2000c80904200780*/ 	IADD R2, g [0x4], R0;

	/*0068*/     /*0xd00e0405a0c00780*/ 	GST.U32 global14 [R2], R1;

	/*0070*/     /*0x2004840900000003*/ 	IADD32I R2, R2, 0x4;

	/*0078*/     /*0x2000cc0504200780*/ 	IADD R1, g [0x6], R0;

	/*0080*/     /*0xd00e040180c00780*/ 	GLD.U32 R0, global14 [R2];

	/*0088*/     /*0xd00e0201a0c00781*/ 	GST.U32 global14 [R1], R0;

		.................................

So the convergence point (set by the SSY instruction) is at instruction 0x58, or just after the if-then-else block.

On the other hand, for the sm_20 target, we get:

code for sm_20

		Function : _Z6kernelPiS_iii

	/*0000*/     /*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];

	/*0008*/     /*0x84009c042c000000*/ 	S2R R2, SR_Tid_X;

	/*0010*/     /*0xd021dc231b0e4000*/ 	ISETP.GE.AND P0, pt, R2, c [0x0] [0x34], pt;

	/*0018*/     /*0x800001e740000000*/ 	@P0 BRA 0x40;

	/*0020*/     /*0xc021dc231b0e4000*/ 	ISETP.GE.AND P0, pt, R2, c [0x0] [0x30], pt;

	/*0028*/     /*0x000021e780000000*/ 	@!P0 EXIT;

	/*0030*/     /*0x08001de428000000*/ 	MOV R0, R2;

	/*0038*/     /*0x60001de740000000*/ 	BRA 0x58;

	/*0040*/     /*0xe021dc23188e4000*/ 	ISETP.LT.AND P0, pt, R2, c [0x0] [0x38], pt;

	/*0048*/     /*0x000021e780000000*/ 	@!P0 EXIT;

	/*0050*/     /*0x09201f841c000000*/ 	I2I.S32.S32 R0, -R2;

	/*0058*/     /*0x10021de218000000*/ 	MOV32I R8, 0x4;

	/*0060*/     /*0x10211ce35000c000*/ 	IMUL.HI R4, R2, 0x4;

	/*0068*/     /*0x80219ca320118000*/ 	IMAD R6.CC, R2, R8, c [0x0] [0x20];

	/*0070*/     /*0x9041dc4348004000*/ 	IADD.X R7, R4, c [0x0] [0x24];

	/*0078*/     /*0xa0221ca320118000*/ 	IMAD R8.CC, R2, R8, c [0x0] [0x28];

	/*0080*/     /*0x1060dc8584000000*/ 	LD.E R3, [R6+0x4];

	/*0088*/     /*0xb0425c4348004000*/ 	IADD.X R9, R4, c [0x0] [0x2c];

	/*0090*/     /*0x00601c8594000000*/ 	ST.E [R6], R0;

	/*0098*/     /*0x0080dc8594000000*/ 	ST.E [R8], R3;

	/*00a0*/     /*0x00001de780000000*/ 	EXIT;

		.................................

This time, it seems there is no convergence point until the end of the kernel. This example should fail (as expected) on Fermi.

Hi Sylvain,

Thanks for the response. I tried my kernel on a GTX 480 (Fermi) card and indeed it fails.

Out of curiosity… how do you modify the predicate mask and the stack when you execute a SSY/BRA/RET instruction? It seems to me that the RET instruction doesn’t modify the mask but has some other mechanism to turn off the returning threads.

I will take that back. I forgot to add the volatile qualifier to A and B. Now the code works on both cards :(

Any other ideas?

Thank you.

In all nvidia gpus, reconvergence happens at the immediate post dominator of the divergent branch. As sylvain says, the return statements nested inside the if-statement create unstructured control flow that push the post dominator down to the exit point.

In tesla, I think that the return instruction disables the threads immediately, so they never need to re-converge. Fermi cannot do this for return instructions (in device functions) because the return may only exit the current function, rather than disable the thread. The compiler may use a general approach for both device functions and kernel entry points, or the hardware may have changed.

EDIT: Sorry I should have probably read the entire thread before posting.

Tesla does support call/return instructions for nested procedure calls. Early versions of CUDA used to inline everything, but we can now declare functions as noinline (some functions in the math library are, I think). For instance, Parboil-FFT compiled for Tesla contains procedure calls.

I originally thought that control-flow instructions in Tesla were behaving as described in US Patent 7353369. It is mostly the case, but the return and break instructions seem to act a bit differently: when predicated by a condition they can also disable threads as Greg describes.

Interestingly, US Patent 7877585 (which seems to match more or less Fermi’s implementation, supporting indirect calls) describes a “disable mask” which does exactly that.

I still fail to understand why the example works on Fermi. There is no indication in the assembly code that threads should reconverge at address 0x58, so how would the hardware know?

Rod, could you try to write the result of clock() in, say, B[tid]? And also writing clock() instead of tid in both branches of the if. This should allow us to tell which branch is executed first, and where it reconverges…
Also please make sure to check with cuobjdump that the assembly is as we expect.

Roddomi is using a compute capability 1.x device - that’s why I stopped short of checking 2.x disassembly after seeing that on 1.x the reconvergence point is set directly after the conditional. I guess someone with a Fermi device needs to try this.

These are the results using clock().

For S1070 (Tesla) with nvcc 3.2/cuobjdump 4.0 (‘nvcc -keep warpsync.cu’):

A[i] = 0       for       i < b0

A[i] = 244576  for b0 <= i < b1

A[i] = 244468  for b1 <= i < b2

A[i] = 0       for b2 <= i

and

B[i] = 0       for       i < b0

B[i] = 244770  for b0 <= i < b2

B[i] = 0       for b2 <= i

and the cuobjdump output:

code for sm_10

                Function : _Z8warpsyncPViS0_iii

        /*0000*/     /*0xa000000504000780*/     I2I.U32.U16 R1, R0L;

        /*0008*/     /*0x3001d3fd6c20c7c8*/     ISET.S32.C0 o [0x7f], g [0x9], R1, LE;

        /*0010*/     /*0xa000d00300000000*/     SSY 0x68;

        /*0018*/     /*0x1000900300000280*/     BRA C0.NE, 0x48;

        /*0020*/     /*0x3001d1fd6c20c7c8*/     ISET.S32.C0 o [0x7f], g [0x8], R1, LE;

        /*0028*/     /*0x3000000300000500*/     RET C0.EQU;

        /*0030*/     /*0x0000000160004780*/     S2R R0, SR1;

        /*0038*/     /*0x30010001c4100780*/     SHL R0, R0, 0x1;

        /*0040*/     /*0x1000d00300000780*/     BRA 0x68;

        /*0048*/     /*0x3001d5fd6c2107c8*/     ISET.S32.C0 o [0x7f], g [0xa], R1, GT;

        /*0050*/     /*0x3000000300000500*/     RET C0.EQU;

        /*0058*/     /*0x0000000160004780*/     S2R R0, SR1;

        /*0060*/     /*0x30010001c4100780*/     SHL R0, R0, 0x1;

        /*0068*/     /*0x30020205c4100782*/     SHL.S R1, R1, 0x2;

        /*0070*/     /*0x2000c80904204780*/     IADD R2, g [0x4], R1;

        /*0078*/     /*0xd00e0401a0c00780*/     GST.U32 global14 [R2], R0;

        /*0080*/     /*0x0000000160004780*/     S2R R0, SR1;

        /*0088*/     /*0x30010001c4100780*/     SHL R0, R0, 0x1;

        /*0090*/     /*0x2000cc0504204780*/     IADD R1, g [0x6], R1;

        /*0098*/     /*0xd00e0201a0c00781*/     GST.U32 global14 [R1], R0;

For GTX 480 (Fermi) with nvcc 4.0 (‘nvcc -keep -arch sm_20 warpsync.cu’):

A[i] = 0         for       i < b0

A[i] = 83798724  for b0 <= i < b1

A[i] = 83798444  for b1 <= i < b2

A[i] = 0         for b2 <= i

and

B[i] = 0         for       i < b0

B[i] = 83798874  for b0 <= i < b1

B[i] = 83798562  for b1 <= i < b2

B[i] = 0         for b2 <= i

and the cuobjdump output:

code for sm_20

                Function : _Z8warpsyncPViS0_iii

        /*0000*/     /*0x00005de428004404*/     MOV R1, c [0x1] [0x100];

        /*0008*/     /*0x84009c042c000000*/     S2R R2, SR_Tid_X;

        /*0010*/     /*0xd021dc231b0e4000*/     ISETP.GE.AND P0, pt, R2, c [0x0] [0x34], pt;

        /*0018*/     /*0xa00001e740000000*/     @P0 BRA 0x48;

        /*0020*/     /*0xc021dc231b0e4000*/     ISETP.GE.AND P0, pt, R2, c [0x0] [0x30], pt;

        /*0028*/     /*0x000021e780000000*/     @!P0 EXIT;

        /*0030*/     /*0x40001c042c000001*/     S2R R0, SR_ClockLo;

        /*0038*/     /*0x04001e036000c000*/     SHL R0, R0, 0x1;

        /*0040*/     /*0x80001de740000000*/     BRA 0x68;

        /*0048*/     /*0xe021dc23188e4000*/     ISETP.LT.AND P0, pt, R2, c [0x0] [0x38], pt;

        /*0050*/     /*0x000021e780000000*/     @!P0 EXIT;

        /*0058*/     /*0x40001c042c000001*/     S2R R0, SR_ClockLo;

        /*0060*/     /*0x04001e036000c000*/     SHL R0, R0, 0x1;

        /*0068*/     /*0x0820de036000c000*/     SHL R3, R2, 0x2;

        /*0070*/     /*0x10209ce35000c000*/     IMUL.HI R2, R2, 0x4;

        /*0078*/     /*0x80311c0348014000*/     IADD R4.CC, R3, c [0x0] [0x20];

        /*0080*/     /*0x90215c4348004000*/     IADD.X R5, R2, c [0x0] [0x24];

        /*0088*/     /*0x00401f8594000000*/     ST.E.WT [R4], R0;

        /*0090*/     /*0x40001c042c000001*/     S2R R0, SR_ClockLo;

        /*0098*/     /*0x04001e036000c000*/     SHL R0, R0, 0x1;

        /*00a0*/     /*0xa0311c0348014000*/     IADD R4.CC, R3, c [0x0] [0x28];

        /*00a8*/     /*0xb0215c4348004000*/     IADD.X R5, R2, c [0x0] [0x2c];

        /*00b0*/     /*0x00401f8594000000*/     ST.E.WT [R4], R0;

        /*00b8*/     /*0x00001de780000000*/     EXIT;

The threads don’t seem to reconverge in Fermi. This is expected given the SASS dump. However, it’s still intriguing why it doesn’t fail with the tid’s.

I think that it makes sense for this particular example. All of the writes to A happen before the corresponding reads, you may just be getting lucky. Although you mentioned that this also succeeded when you changed the plus to a minus, so either the scheduling order gets flipped somehow and it works by accident, or something else is going on here…

Just a quick update: I found out that the code above fails on Fermi. It fails in the opposite direction (B[tid] = A[tid - 1]). I was checking both directions on sm_10 but forgot to do the same for sm_20. Now it makes more sense. Thank you.