NVCC bug report: a runtime error

I have just ran into a bug in NVCC. Tried both NVCC 2.1 on x86 and x86_64, both fail.

Compile and run the following code, and the result is wrong. The output is supposed to be 111. But on x86_64, it gives 0. And on x86, it outputs trash value. Device emulation mode works fine, though.

Compile options:

NVCC test.cu -arch=compute_13

code is as follows:


// test.cu

#include <cuda.h>

#include <stdio.h>


// array based lock-free queue

template <unsigned int N>

class LFQueue{


		__device__ LFQueue():head(0),tail(0),count(0){}

		__device__ void push(void * p){

				unsigned int index=atomicInc(&tail,N-1);	  // get the next index


				atomicInc(&count,N);	// increase the number of available elements



		void * array[N];

		unsigned int head;

		unsigned int tail;

		unsigned int count;


__global__ void foo(int * arr){

		LFQueue<2> q;

		int id;





const int numBlocks=1;

const int numThreads=1;

const int N=numBlocks*numThreads;

int main(){


		int * h_arr=(int *)malloc(sizeof(int)*N);

		int * d_arr;

		cudaMalloc((void **)&d_arr,sizeof(int)*N);




		for(int i=0;i<N;i++){





		return 0;


What kind of GPU are you using? Which OS is this?


OS: RHEL 5 x86 and Fedora 9 x86_64

I have just tried to modify the code and get rid of the templates and class, but the error persists. Could someone tell me how to work around this bug?

Any suggestion is appreciated.

Thanks. I’ve reproduced this problem and opened bug 533279. I’ll let you know if there’s a workaround.

$ nvcc testQ.cu -arch=compute_20

nvcc fatal : Value ‘compute_20’ is not defined for option ‘gpu-architecture’

Both on EL5 x86 and Fedora 9 x86_64. CUDA version is 2.1.

Am I missing something? Or should I try other version of CUDA?

Thanks very much for your reply.

What is going on here is that ‘atomicInc’ works relative to the “global” address space. But the program is written as if it can operate on thread locals (references to ‘&tail’ and ‘&count’ in the storage for ‘q’). As a result these atomic ops are actually stepping on the global address space where ‘d_arr’ is allocated. The compiler can decide how to place ‘q’ and where ‘d_arr’ is allocated can differ between different system configurations. That explains the runtime differences you noticed.

Check out the description for atomicInc in the CUDA 2.1 Reference Manual

Hope that helps :thumbup:

I see. I used a dynamic allocated memory for storing q, and the code runs without error.

Thanks very much!