Possible bug: global variables, predication nvcc-generated PTX contains race condition on store to g

kerrmudgeon · September 17, 2009, 10:12pm

During an experiment, I noticed the CUDA 2.3 compiler using predication in a manner I believe to be incorrect. Consider the following CUDA kernel:

//

// test.cu

//

__device__ float glb_Test;

extern "C" __global__ void k_test() {

	if (threadIdx.x == 0) {

		glb_Test *= 2.5f;

	}

}

Only one thread should ever store to the global variable. Now consider this kernel compiled as PTX:

//

// nvcc -ptx test.cu

//

	.global .f32 glb_Test;

	.entry k_test

	{

	.reg .u32 %r<4>;

	.reg .f32 %f<6>;

	.reg .pred %p<3>;

	.loc	15	4	0

$LBB1_k_test:

	ld.global.f32 	%f1, [glb_Test];

	mov.f32 	%f2, 0f40200000;	 	// 2.5

	mul.f32 	%f3, %f1, %f2;

	cvt.u32.u16 	%r1, %tid.x;

	mov.u32 	%r2, 0;

	setp.eq.u32 	%p1, %r1, %r2;

	selp.f32 	%f4, %f3, %f1, %p1;

	st.global.f32 	[glb_Test], %f4;

	.loc	15	8	0

	exit;

$LDWend_k_test:

	} // k_test

The global variable is loaded, the multiply computed, and either the original loaded value or the new computed value are selected based on %tid.x. The selected result is then UNCONDITIONALLY stored back to the global variable.

For kernels with 1 thread, this results in the correct behavior. For kernels with 2 or more threads, the st.global presents a race condition. Threads with %tid.x > 0 attempt to store the original value, while thread %tid.x=0 stores the computed value. I am not aware of this behavior documented anywhere and believe it may be constitute a bug in the compiler.

It seems that either the st.global should be predicated thereby rendering the selp unnecessary, or a branch instruction should pass control around the PTX basic block for threads with $tid.x != 0.

Here is my platform (Ubuntu 9.04, Linux x64, CUDA 2.3):

kerrmudgeon@ CUDAmodules $ uname -a

Linux lucerne 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:58:03 UTC 2009 x86_64 GNU/Linux

kerrmudgeon@ CUDAmodules $ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2009 NVIDIA Corporation

Built on Thu_Jul_30_09:24:36_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

kerrmudgeon@ CUDAmodules $

Regards,

Kerrmudgeon

Topic		Replies	Views
Conditional assignment to global Bug or intended behaviour? CUDA Programming and Performance	10	4629	January 17, 2010
write to global memory from multiple threads and racing conditions CUDA Programming and Performance	3	3332	April 26, 2009
Conditional write to global memory CUDA Programming and Performance	3	3504	September 4, 2007
Race Condition CUDA? CUDA Programming and Performance	4	2839	November 15, 2014
Kernel exit does not flush writes if atomics are held. Expected behaviour? CUDA Programming and Performance	14	3602	June 18, 2009
Strange behavior with global __device__ variable CUDA Programming and Performance	2	2173	July 5, 2009
CUDA Memory Consistency CUDA Programming and Performance	23	55839	March 8, 2007
Predication & do ... while loops What happens with this sample loop ... CUDA Programming and Performance	2	4122	June 4, 2008
Predicated Execution Cannot replicate example in PTX ISA 2.1 reference document... CUDA Programming and Performance	4	2370	July 13, 2010
Race condition, CUDA Programming and Performance	0	536	April 14, 2014

Possible bug: global variables, predication nvcc-generated PTX contains race condition on store to g

Related topics