Possible bug: global variables, predication nvcc-generated PTX contains race condition on store to g

During an experiment, I noticed the CUDA 2.3 compiler using predication in a manner I believe to be incorrect. Consider the following CUDA kernel:


// test.cu


__device__ float glb_Test;

extern "C" __global__ void k_test() {

	if (threadIdx.x == 0) {

		glb_Test *= 2.5f;



Only one thread should ever store to the global variable. Now consider this kernel compiled as PTX:


// nvcc -ptx test.cu


	.global .f32 glb_Test;

	.entry k_test


	.reg .u32 %r<4>;

	.reg .f32 %f<6>;

	.reg .pred %p<3>;

	.loc	15	4	0


	ld.global.f32 	%f1, [glb_Test];

	mov.f32 	%f2, 0f40200000;	 	// 2.5

	mul.f32 	%f3, %f1, %f2;

	cvt.u32.u16 	%r1, %tid.x;

	mov.u32 	%r2, 0;

	setp.eq.u32 	%p1, %r1, %r2;

	selp.f32 	%f4, %f3, %f1, %p1;

	st.global.f32 	[glb_Test], %f4;

	.loc	15	8	0



	} // k_test

The global variable is loaded, the multiply computed, and either the original loaded value or the new computed value are selected based on %tid.x. The selected result is then UNCONDITIONALLY stored back to the global variable.

For kernels with 1 thread, this results in the correct behavior. For kernels with 2 or more threads, the st.global presents a race condition. Threads with %tid.x > 0 attempt to store the original value, while thread %tid.x=0 stores the computed value. I am not aware of this behavior documented anywhere and believe it may be constitute a bug in the compiler.

It seems that either the st.global should be predicated thereby rendering the selp unnecessary, or a branch instruction should pass control around the PTX basic block for threads with $tid.x != 0.

Here is my platform (Ubuntu 9.04, Linux x64, CUDA 2.3):

kerrmudgeon@ CUDAmodules $ uname -a

Linux lucerne 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:58:03 UTC 2009 x86_64 GNU/Linux

kerrmudgeon@ CUDAmodules $ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2009 NVIDIA Corporation

Built on Thu_Jul_30_09:24:36_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

kerrmudgeon@ CUDAmodules $