Kernel exit does not flush writes if atomics are held. Expected behaviour?

The application below prints “-123” on a TESLA C1060. The expected answer is 123.

Un-commenting either the threadFence or the un-lock code results in correct behaviour.

Is this an expected behaviour? Kernel exits should just flush everything, no?

#include <stdio.h>

__device__ int result = -123;

__device__ int g_lock=-1;

__global__ void checkKernel(void)

{

	if(atomicCAS(&g_lock, -1, (int)threadIdx.x) == -1)

	{

		result = 123;

		/* __threadfence(); */

		/* atomicExch(&g_lock, -1); */

	}

	__syncthreads();

	return;

}

int main()

{

	cudaError_t err;

	int host=-1;

	checkKernel<<< 1, 32>>>();

	err = cudaThreadSynchronize();

	if (err != 0)

	{

		printf("Error launching kernel, err = %d\n",err);

		return -1;

	}

	cudaMemcpyFromSymbol(&host, "result", sizeof(int), 0, cudaMemcpyDeviceToHost);

	printf("host = %d\n", host);

	return 0;

}

Any explanation on this behaviour?

I’m sure that CUDA compiler change the way to compile “result = 123” into assembly when you activate those 2 lines in the IF path. Sometimes the problem is caused by an over-smart optimization during compiling.

I think you need to examine the resulting PTX code each time you find somethings strange like this and grab all of them forwarding to tmurray :)

Point taken, CVN.

So, Here is the PTX that corresponds to buggy behaviour:

mov.u32 	%r1, g_lock;		 	// 

	mov.s32 	%r2, -1;			 	// 

	cvt.s32.u16 	%r3, %tid.x;	 	// 

	atom.global.cas.b32 	%rv1, [%r1], %r2, %r3;	// 

	mov.s32 	%r4, %rv1;		   	// 

	.loc	14	6	0

	mov.s32 	%r5, 123;				// 

	ld.global.s32 	%r6, [result]; 	// id:6 result+0x0

	mov.s32 	%r7, -1;			 	// 

	setp.eq.s32 	%p1, %r4, %r7;   	// 

	selp.s32 	%r8, %r5, %r6, %p1; 	// 

	st.global.s32 	[result], %r8; 	// id:6 result+0x0

	.loc	14	15	0

	bar.sync 	0;					  // 

	.loc	14	17	0

	exit;						 	//

Look @ the way “selp” and the following “st” being executed by all threads. So there are 32-writes to that memory location instead of just 1, I think… And hence the behaviour.

This one looks buggy to me.

I’d like to thank you Sarnath, for the last couple of posts. I found it very interesting and your desire to find exactly whats going on

and understand it sure is admirable.

I wish nVidia could supply some official answers here or in the programming guide.

Anyway thanks for those interesting posts… :)

eyal

Thank you Eyal! I have got excellent support from many friends out here, notably CVNguyen, PDan and of course Tim.

Hi Sarnath,

I think this is not a bug. The problem you raised here as well as other issues we have discussed recently are all relevant to this IF form:

if (atomicCAS(...........) == ...)

This sounds like a trap for the CUDA compiler. It cannot recognize the possibility of warp divergence, so maximum optimization is applied regardless the effects of warp divergence.

I think supporting such a compiler directive before IF/FOR/WHILE… will solve those issues easily:

#pragma warp_may_diverge

if (....)

#pragma warp_may_diverge

while (...)

CVN,

Warps can diverge. That is not an issue at all. The above code expects one of the thread to write -123 out there… But in the resultant PTX code, all threads are writing something to the memory location – which is NOT what I want.

And, I dont get desired result as well.

And, the code is NOT depending on sub-warp-scheduling (associated with the deadlock like problems we faced before).

So, I really think this is an issue. Not sure, why you think otherwise.

I do not really understand PTX code. st stands for store but what is selp?

Note however that inside if statement you have only one instruction so it is likely to be resolved through predicate rather than a branch instruction. So although it seems you have 32 threads accessing the memory, there might be actually only one.

From Programming Guide:

“selp” selects an operand…for the next instruction (in this context, it selects between r5 (123) and r6(-123))
It is the equivalent of the C ternary operator ( (a > b) ? a : b )
In this context it is used as ( predicate == true ? r5 : r6 )

The quote from the pg is about “predicated” execution.
In this case, the predicate is just used by the “selp” instruction to select between r5 and r6 for all the threads in question.

Note that the “st” (store) is NOT predicated. So all threads execute ST(ore).

The thread that got the lock will store “123” and all other threads store “-123”.

All threads are writing to same global memory location – one of them will succeed (according to the pg) and that is NOT what the “C” code intends to do.

Thanks :)
In that case, that PTX code is nonsense comapred to C.
Why it is actually loading the value of ‘result’ into register in the first place!?

Thats right. The load is required to supply the operand for the stores of all un-necessary threads,.

There are 2 way to compile your C code:

result = -123;

if (......)

	goto Skipped;

result = 123;

Skipped;

or

result = (.......)? -123 : 123;

The latter is done with SELP, which is faster than the former. However, the latter is not warp-divergence-friendly. I think the compiler assumes that the warp cannot diverge with [font=“Courier New”]if (atomicCAS(…) == -1)[/font], so it just chooses the faster way. If you used an ‘if’ clause always giving the same result in all threads (e.g. [font=“Courier New”]if (g_lock == -1)[/font]), the latter would work flawlessly.

In the previous issues that we have discussed so far, the story is similar. The compiler assumes that there is no warp divergence, so it goes optimizing further regardless side effects on warp divergence.

The problem is that the compiler is not smart enough to recognize all possible warp divergence segments (context-dependent) in the C code. I think maybe there is no perfect solution for this issue, so I recommend NVIDIA to support such a compiler directive as [font=“Courier New”]#pragma warp_may_diverge[/font]. The C programmers are responsible to put this pragma at certain code snippets as desired, and that would ease the design of CUDA compiler.

CVN,

The compiler can optimize in any way it likes. All it had to do is to “predicate” the store instruction. Thats all it had to do.
It is not doing and is not yeilding correct expected output.

This has got nothing to do with deadlock problems where the code was relying on warp-scheduling… There is no such thing in the code above.

It is just a plain bug.

Tim,

CAn you look into this one as well?

This is also related to atomics and not-so-correct code generation.

may b, this is some way related to the smem atomics (although this one uses gmem atomics).

Best Regards,
Sarnath