I have just ran into a bug in NVCC. Tried both NVCC 2.1 on x86 and x86_64, both fail.
Compile and run the following code, and the result is wrong. The output is supposed to be 111. But on x86_64, it gives 0. And on x86, it outputs trash value. Device emulation mode works fine, though.
Compile options:
NVCC test.cu -arch=compute_13
code is as follows:
//==============
// test.cu
#include <cuda.h>
#include <stdio.h>
//==============================
// array based lock-free queue
template <unsigned int N>
class LFQueue{
public:
__device__ LFQueue():head(0),tail(0),count(0){}
__device__ void push(void * p){
unsigned int index=atomicInc(&tail,N-1); // get the next index
array[index]=p;
atomicInc(&count,N); // increase the number of available elements
}
public:
void * array[N];
unsigned int head;
unsigned int tail;
unsigned int count;
};
__global__ void foo(int * arr){
LFQueue<2> q;
int id;
id=blockIdx.x*blockDim.x+threadIdx.x;
q.push(NULL);
arr[id]=111;
}
const int numBlocks=1;
const int numThreads=1;
const int N=numBlocks*numThreads;
int main(){
cudaSetDevice(0);
int * h_arr=(int *)malloc(sizeof(int)*N);
int * d_arr;
cudaMalloc((void **)&d_arr,sizeof(int)*N);
foo<<<numBlocks,numThreads>>>(d_arr);
cudaMemcpy(h_arr,d_arr,sizeof(int)*N,cudaMemcpyDeviceToHost)
;
for(int i=0;i<N;i++){
printf("%d\t",h_arr[i]);
if(i%8==0)
printf("\n");
}
return 0;
}
I have just tried to modify the code and get rid of the templates and class, but the error persists. Could someone tell me how to work around this bug?
What is going on here is that ‘atomicInc’ works relative to the “global” address space. But the program is written as if it can operate on thread locals (references to ‘&tail’ and ‘&count’ in the storage for ‘q’). As a result these atomic ops are actually stepping on the global address space where ‘d_arr’ is allocated. The compiler can decide how to place ‘q’ and where ‘d_arr’ is allocated can differ between different system configurations. That explains the runtime differences you noticed.