Shared memory problems

I’ve been fighting with shared memory alot these last days. It seems shared memory is a bit unstable. For example, in one kernel I allocate an array of a struct in shared memory, same size as the number of threads in a block, if I add a member to the struct which I dont have to even touch, compilation goes fine but ones I run the software, It crashes my computer and I have to restart. Output struct is allocated in shared memory and passed as reference to the function argument. I do this because I want to lower register usage and coalesced global writing. ( func is in this case a kdtree raytracer )

[codebox]template <class Input, class Output, class F>

global void PersistentWorker(Input* input, int count, Output* output, F func)

{

int rayIndex = (blockIdx.x*threadHeight +threadIdx.y)*threadWidth + threadIdx.x;



__shared__ int nextRay[threadHeight];

__shared__ Output tempOut[threadSize];



if( threadIdx.x == 0 )

{

	nextRay[threadIdx.y] = rayIndex;

}

while( rayIndex < count )

{

	func.Call(input[rayIndex], tempOut[threadIdx.x+threadWidth*threadIdx.y], rayIndex);

	

	float* from = (float*)&tempOut[threadWidth*threadIdx.y];

	float* to = (float*)&output[nextRay[threadIdx.y]];

	

	for(int i=0; i<sizeof(Output)/4; i++)

	{

		int index = threadWidth*i + threadIdx.x;

		to[index] = from[index];

	}

			

	if( threadIdx.x == 0 )

	{

		nextRay[threadIdx.y] = atomicAdd(&g_RayCount, threadWidth);

	}

	rayIndex = nextRay[threadIdx.y] + threadIdx.x;

}

}[/codebox]

Also when I tried avoid using an atomic for every iteration by adding the folling code it always crashes my computer.

[codebox]template <class Input, class Output, class F>

global void PW3(Input* input, int count, Output* output, F func)

{

int rayIndex = (blockIdx.x*threadHeight +threadIdx.y)*threadWidth + threadIdx.x;



__shared__ volatile int nextRay[threadHeight];

__shared__ Output tempOut[threadSize];

__shared__ volatile int rayCountLeft[threadHeight];



if( threadIdx.x == 0 )

{

	nextRay[threadIdx.y] = rayIndex;

	rayCountLeft[threadIdx.y] = 0;

}

while( rayIndex < count )

{

	func.Call(input[rayIndex], tempOut[threadIdx.x+threadWidth*threadIdx.y], rayIndex);

	

	volatile float* from = (float*)&tempOut[threadWidth*threadIdx.y];

	volatile float* to = (float*)&output[nextRay[threadIdx.y]];

	

	for(int i=0; i<sizeof(Output)/4; i++)

	{

		int index = threadWidth*i + threadIdx.x;

		to[index] = from[index];

	}

			

	if( threadIdx.x == 0 )

	{

		if( rayCountLeft[threadIdx.y] == 0 )

		{

			nextRay[threadIdx.y] = atomicAdd(&g_RayCount, threadWidth*4);

			rayCountLeft[threadIdx.y] = 3;

		}

		else

		{

			rayCountLeft[threadIdx.y]--;

			nextRay[threadIdx.y] += threadWidth;

		}	

	}

	rayIndex = nextRay[threadIdx.y] + threadIdx.x;

}

}[/codebox]

I’ve been using cuda for a few months now and I keep running in to strange bugs all the time that wastes a lot of development time. Am I the only one?

I’ve tried to examine the assembly but so far I’ve found nothing.

Since no one was to eager to answer my question I will ask an perhaps easier one. How come that if I comment out this line:

nextRay[threadIdx.y] += threadWidth;

then the program doesn’t crash the system any more (doesn’t work like it should either) and if I uncomment the line again and modify the branch so that the line will never execute I will still get a crash.

Are threadHeight & threadSize compile-time constants?
I’ve never had any problems with shared memory that weren’t the result of a bug on my part. I have had missing synchronization cause all kinds of strange behaviors.
Can you run your code in emulation mode? Then you can put in bounds-checking to ensure you aren’t indexing past the end of the shared memory arrays of structs.

You’re probably having some out-of-bounds accesses. If you run with linux- valgrind can help, otherwise try to start with a very simple kernel and grow up to the current kernel code so that

you can figure out exactly where the problem is. Do it very carefully and make sure the compiler doesnt optimize out the kernel.

Furthermore, your shared memory code seems to be faulty.

__shared__ int nextRay[threadHeight];

if( threadIdx.x == 0 )	

{

   nextRay[threadIdx.y] = rayIndex;	

}	

///////////////////////////////////////////////////////////////

// You should probably put here a __syncthreads() 

///////////////////////////////////////////////////////////////

while( rayIndex < count )	

{

   func.Call(input[rayIndex], tempOut[threadIdx.x+threadWidth*threadIdx.y], rayIndex);	

   float* from = (float*)&tempOut[threadWidth*threadIdx.y];		

/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

   // Without the __syncthreads the nextRay[ threadIdx.y ] is probably garbage and this is your index to a memory location !! 

   float* to = (float*)&output[nextRay[threadIdx.y]];	

   /////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

   ....

When you comment a part of your code, compiler may detect that some bigger parts are comletly useless and decide to cut them out as well. However if you put it in a branch, that cannot be evaluated at compile time, he cannot perform that optimisation. Bottom line, the bug might not be in the line you commented, but somewhere earlier!

I’ve tried to increment complexity of the kernel incremently from a very basic one towards the complete one. However the result is isn’t a straight path, some times it depends on a combination of two items, sometimes the size of a struct. The only common theme is that doesn’t depend on logic.

Btw, I should clarify, threadWidth is 32 (threadHeight=2) so there shouldn’t be any need for synchronization.

I’m getting strange result from device emu. I would assume the calculation perform one instruction per warp at a time. But it almost looks like each thread takes a lap of the main loop, one at a time. Which obviously sounds ridiculus.
Could it be that if there isn’t any sync barriers that each thread takes an arbitrary number steps before next thread in the same warp in device emu?