Possible Bug in CUDA2.2 occurs when all threads access one variable


maybe I’m doing something wrong, but this is too strange. Please have a look at this kernel:

(SimplexState is an enum with ~8 values)

__global__ void

doRCEStepsD(float *simplices,

			float *pointA,

			float *pointB,

			unsigned int *_max,

			SimplexState *simplexState,

			unsigned int *activeSimplices,

			unsigned int *pNumActiveSimplices,

			unsigned int constantsCount,

			unsigned int log2ThreadsPerSimplex,

			unsigned int log2SimplicesPerBlock)


	unsigned int locIndex = threadIdx.x;

	unsigned int simplexSize = IMUL((constantsCount + 1u), constantsCount);

	//load number of simplices that have to be considered by this kernel

	/* this works !

	__shared__ unsigned int numActiveSimplices;

	if (locIndex == 0) {

		numActiveSimplices = (*pNumActiveSimplices);




	// this doesn't work !

	unsigned int numActiveSimplices;

	numActiveSimplices = (*pNumActiveSimplices);


// ... 


“pNumActiveSimplices” is allocated with cudaMalloc, and has the size of one single “unsigned int”.

If only one thread loads it into a shared variable, all is ok…but if each thread of every block has to read it, I get an error

“cutilCheckMsg cudaThreadSynchronize error, line 1260 : unknown error”.

Does anyone have a clue why the first version works, but the second doesn’t…?

I’m using CUDA 2.2 2.3 (Windows Vista, GTX 280).

EDIT: Sorry, I forgot they already installed version 2.3