During an experiment, I noticed the CUDA 2.3 compiler using predication in a manner I believe to be incorrect. Consider the following CUDA kernel:
//
// test.cu
//
__device__ float glb_Test;
extern "C" __global__ void k_test() {
if (threadIdx.x == 0) {
glb_Test *= 2.5f;
}
}
Only one thread should ever store to the global variable. Now consider this kernel compiled as PTX:
//
// nvcc -ptx test.cu
//
.global .f32 glb_Test;
.entry k_test
{
.reg .u32 %r<4>;
.reg .f32 %f<6>;
.reg .pred %p<3>;
.loc 15 4 0
$LBB1_k_test:
ld.global.f32 %f1, [glb_Test];
mov.f32 %f2, 0f40200000; // 2.5
mul.f32 %f3, %f1, %f2;
cvt.u32.u16 %r1, %tid.x;
mov.u32 %r2, 0;
setp.eq.u32 %p1, %r1, %r2;
selp.f32 %f4, %f3, %f1, %p1;
st.global.f32 [glb_Test], %f4;
.loc 15 8 0
exit;
$LDWend_k_test:
} // k_test
The global variable is loaded, the multiply computed, and either the original loaded value or the new computed value are selected based on %tid.x. The selected result is then UNCONDITIONALLY stored back to the global variable.
For kernels with 1 thread, this results in the correct behavior. For kernels with 2 or more threads, the st.global presents a race condition. Threads with %tid.x > 0 attempt to store the original value, while thread %tid.x=0 stores the computed value. I am not aware of this behavior documented anywhere and believe it may be constitute a bug in the compiler.
It seems that either the st.global should be predicated thereby rendering the selp unnecessary, or a branch instruction should pass control around the PTX basic block for threads with $tid.x != 0.
Here is my platform (Ubuntu 9.04, Linux x64, CUDA 2.3):
kerrmudgeon@ CUDAmodules $ uname -a
Linux lucerne 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:58:03 UTC 2009 x86_64 GNU/Linux
kerrmudgeon@ CUDAmodules $ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2009 NVIDIA Corporation
Built on Thu_Jul_30_09:24:36_PDT_2009
Cuda compilation tools, release 2.3, V0.2.1221
kerrmudgeon@ CUDAmodules $
Regards,
Kerrmudgeon