Conditional assignment to global Bug or intended behaviour?

Simple question. What is the expected behaviour of the following code:

__syncthreads();

if (threadIdx.x==0)

  global_variable=some_value;

__syncthreads();

Possible answers:

a) global_variable should be equal to some_variable, provided it is not changed elsewhere (e.g. in another block)

B) global_variable may or may not be assigned some_variable, depending on scheduling of particular threads within a warp

c) Answer (B) used to be the case in some old CUDA compiler but it has been corrected.

or something else I am not aware of…

So, what is the case?

global_variable will change its value N times, where N is the amount of blocks you have for 1 Dimension Blocks.

depending on what you need, maybe you meant this

__syncthreads();

if( blockIdx.x*blockDim.x + threadIdx.x == 0 ){

		//DO SOMETHING

		global_variable=some_value;

}

__syncthreads();

I understand my tiny code is not necessairly something you would want in a program. It is more a question of GPU behaviour.
For simplicity let’s assume there is only one block launched and it is only one dimentional (blockDim.y==blockDim.z==1). Will this result in exactly one assignment to global_variable and will it be equal to some_value? Hence, is answer (a) valid?

The reason I am asking, is that I have a code which does not follow this behaviour, but I want to understand if it is intended or not…

The answer is a). But __syncthreads(); is only a barrier at block level and you also don’t know in which order blocks will be executed. So when a thread from some other block reads that variable it may still see the old value. After __syncthreads(); the new value is visible to all threads in the same block and after the kernel has terminated it is visible globally.

Well he did write " provided it is not changed elsewhere (e.g. in another block)" so you shouldn’t assume that it changes…

// if using a group of one-dimensional blocks, and instead having and array with N elements

global_variable[threadIdx.x + blockIdx.x*blockDim.x] = some_value;

Anyways. if i understand you correctly i would go with A.

OK, try this program:

#include <stdio.h>

__device__ int gval;

__global__ void clear() {

  gval=-5;

}

__global__ void theAnswer(int *result, int param) {

  if (threadIdx.x==0)	//(**)

	gval=param;

  __syncthreads();

  if (threadIdx.x==0)

	*result=gval;

}

int main() {

  int hostResult;

  int *devResult;

  cudaMalloc((void**)&devResult,sizeoif(int));

  clear<<<1,1>>>();

  theAnswer<<<1,32>>>(devResult,42);

  cudaMemcpy(&hostResult,devResult,sizeof(int),cudaMemcpyDeviceToHost);

  printf("result=%d\n",hostResult);

  return 0;

}

On my two computers (one with GTX260, CUDA 2.3, other with GT 9600M, CUDA 2.0) I get the same result: -5. Why is that?

When I remove line (**), I get correct result!

Can you reproduce it?

Yes, I get the same answer… I’m not sure that way of using global variable is supported.

If you do this

__global__ void theAnswer(int *result, int param)

{

  float gval = -5;

	if (threadIdx.x == 0)	//(**)

	gval=param;

	__syncthreads();

	if (threadIdx.x==0)

	*result=gval;

}

You will of course get the expected behaviour. I will check the manual later for what is supported or not…

If that is the case, when exactly assignment to global variables in a branch are supported, and when are not?

I was reading manual myself and did not see a mention about that, but maybe I missed it or forgot?..

Thank you for your time spending on this issue!

I’m actually glad that someone came across this again. My friend posted this as a potential bug about 2 months ago and didn’t get much response. This problem manifests itself only with variables declared with the device specifier, in which case memory operations to the variables are sometimes converted into conditional selection. This means that all threads will load and store to the variable, not only those for which the condition threadIx.x == 0, introducing a race condition.

In your example the resulting PTX for the second kernel is as follows:

.entry _Z9theAnswerPii (

		.param .u32 __cudaparm__Z9theAnswerPii_result,

		.param .s32 __cudaparm__Z9theAnswerPii_param)

	{

	.reg .u32 %r<9>;

	.reg .pred %p<3>;

	.loc	16	9	0

$LBB1__Z9theAnswerPii:

	cvt.u32.u16 	%r1, %tid.x;

	mov.u32 	%r2, 0;

	setp.eq.u32 	%p1, %r1, %r2;

	ld.param.s32 	%r3, [__cudaparm__Z9theAnswerPii_param];

	ld.global.s32 	%r4, [gval];

	selp.s32 	%r5, %r3, %r4, %p1;

	st.global.s32 	[gval], %r5;

	.loc	16	12	0

	bar.sync 	0;

	@!%p1 bra 	$Lt_1_1794;

	.loc	16	14	0

	ld.global.s32 	%r6, [gval];

	ld.param.u32 	%r7, [__cudaparm__Z9theAnswerPii_result];

	st.global.s32 	[%r7+0], %r6;

$Lt_1_1794:

	.loc	16	15	0

	exit;

$LDWend__Z9theAnswerPii:

	} // _Z9theAnswerPii

The following code section is always executed, by all threads. Even though only the first thread will store the new value, all other threads will store the old value.

ld.global.s32 	%r4, [gval];

	selp.s32 	%r5, %r3, %r4, %p1;

	st.global.s32 	[gval], %r5;

Note that declaring the variable volatile will make sure that it is correctly wrapped with a branch so that there is not race condition. The new section becomes

setp.eq.u32 	%p1, %r1, %r2;

	@!%p1 bra 	$Lt_1_1794;

	ld.param.s32 	%r3, [__cudaparm__Z9theAnswerPii_param];

	st.volatile.global.s32 	[gval], %r3;

$Lt_1_1794:

Note the branch instruction ‘@!%p1 bra $Lt_1_1794;’ replaces the conditional select ‘selp.s32 %r5, %r3, %r4, %p1;’

So in other words, you say it is transformed into:

gval=(threadIdx.x==0?42:gval)

I heard it used to be the case in the old times but I thought it was corrected (my answer (B)). I never noticed that it happens only with device variables but I had troubles when trying to reduce the bug from big code to few lines in the past. I stubled on this problem only recently and was able to reduce it to the simple code above, probably because (accidently) I used device variable this time :)

Yes, exactly. Removing the branch is typically a valid compiler transformation only if the address being accessed is guaranteed to point to a an allocated variable in all possible executions of the program. This is much easier to prove for global variables rather than those allocated dynamically via malloc, which may explain why nvcc only does this for device variables.

My opinion is that the compiler should be made aware that the code is executing in parallel, and this type of transformation should only be valid for thread-local variables.