Shared Memory Problems - __syncthreads() doesn't work?

Hi,

I am trying to write a kernel which will process an image row by row; if the image has 1000 rows, then I will launch 1000 blocks which have 512 threads (maximum thread per block count in my GTS 450). Each block will process a row of the image, for example if the image size is 1500x1500 then I will launch the kernel with <<<1500,512>>> and each block will process its according row (indicated by the blockIdx.x variable ) in a sliding window fashion. In the first pass of a block the pixels [0,511] in the row[blockId.x] are processed, then the pixels [512,1023] are processed and finally [1023,1499]. I tried to implement the algorithm fully but many strange errors occurred and I had to simplify it such that at the moment it only writes simple colors on the target image via surf2Dwrite function. (By the way, the kernel is directly writing into a D3D9 texture via a surface reference.)

Current state of the code is the following:

__shared__ unsigned int passCount;	

__global__ void ShadowKernel2(RayMarchInitInfo* input)

{

	unsigned int tid;

	float dx;

	float dy;

	float dz;

	//Initialization Step 1, global to shared memory transfer, done by thread 0.

	if(threadIdx.x == 0)

	{

		passCount=0;

		__threadfence();

	}

	__syncthreads();

	

	dx=const_dx[0];

	//dy=const_deltaY[0];

	//dz=const_dz[0];

	__syncthreads();

	

	while(true)

	{

		__syncthreads();

		tid=passCount*blockDim.x + threadIdx.x;	

		

		__syncthreads();

		surf2Dwrite(0xFFFFFF00, surf_Default,  (tid) * sizeof(unsigned int), blockIdx.x);

		

		__syncthreads();

		if(tid >= 750)

			break;

		__syncthreads();

		if(threadIdx.x == 0)

		{

			passCount++;

			__threadfence();

		}

		__syncthreads();

		

	}

}

This kernel is only for debugging purposes; there is a shared memory variable called passCount, which I change via thread 0 in the current block. This variable controls the current position of the whole block on its according image row. Each thread of the block writes a color to the position tid=passCount*blockDim.x + threadIdx.x. After writing to the image, if a thread’s offset from the first pixel of the row is greater than 750(I made this number constant in order to simplify the debugging) the thread exits from its main loop. Since only thread 0 controls the shared variable passCount, I placed a __threadfence() whenever it changes the value of it. Moreover, I explicitly synchronized ALL the threads in the block to avoid any undefined behavior by placing a __syncthreads() after every line. const_dx[0] is a constant memory array which I used in the original algorithm but it doesn’t do anything meaningful in the current code.

When I run this kernel with the line dx=const_dx[0] open and with the call ShadowKernel2<<<1400,512>>>(inputList) it outputs the following image:

According to my observations, some of the threads in some blocks erroneously reads passCount as 2 where the correct value is 1 and they appear as shifted to right by 512 pixels. This happens despite all the __syncthreads() and __threadfence() calls after each line and I can’t find any meaningul explanation to that, because all threads are heavily synchronized!

Only clue I have is, If I remove or comment out the line dx=const_dx[0] the kernel works like expected:

So I am absolutely clueless what causes this error. The access to the shared memory is heavily synchronized so there can’t be any race condition between threads. So I desperately need any advice on this situation.

Thanks in advance.

The __syncthreads() in your example aren’t encountered by all threads, because some exit the loop earlier than others. Make sure __syncthreads() is always encountered by all threads (or none).

A quick question; if there are 512 threads in a block and 100 of them has exited the kernel, does CUDA automatically detect that those 100 threads finished their job and behave accordingly? I mean that after 100 threads exit, will CUDA synchronize the 412 threads still left at a __syncthreads() call or continue since there aren’t 512 threads active anymore?

Secondly, if this is the case, how can the removal of the line dx=const_dx[0] change the result of the kernel?

No, all threads still need to be running or behavior is undefined. I vaguely remember that even compute capability 1.x and 2.x behave differently in the undefined case.

Ok, then, will the following code be correct?

__shared__ unsigned int passCount;      

__global__ void ShadowKernel2(RayMarchInitInfo* input)

{

        unsigned int tid;

        float dx;

        float dy;

        float dz;

//Initialization Step 1, global to shared memory transfer, done by thread 0.

        if(threadIdx.x == 0)

        {

                passCount=0;

                __threadfence();

        }

        __syncthreads();

dx=const_dx[0];

        dy=const_deltaY[0];

        dz=const_dz[0];

        __syncthreads();

while(true)

        {

                //Bind the thread exit condition to passCount variable, such that ALL threads either work or exit at this pass.

                if(passCount*blockDim.x >= 750)

                    break;

tid=passCount*blockDim.x + threadIdx.x; 

__syncthreads();

                if(tid < 750)

                     surf2Dwrite(0xFFFFFF00, surf_Default,  (tid) * sizeof(unsigned int), blockIdx.x);

if(threadIdx.x == 0)

                {

                        passCount++;

                        __threadfence();

                }

                __syncthreads();

}

}

In this version of the code I bind the thread exit condition to shared passCount variable such that all threads in the block either exit the loop or continue to run. Now would this work correctly?

That code looks ok, you don’t even need the __threadfence() calls.

Usually in this case passCount wouldn’t be made a shared variable. Each thread would have it’s own instance, counting on it’s own, but having an exit condition that is the same for all variables. This would avoid having to sync all threads at all.
However your code should work too, so you can take it as the basis for you more complex code.