CUDA Warp Synchronization Problem

In generalizing a kernel thats shifts the values of a 2D array one space to the right (wrapping around the row boundaries), I have come across a warp synchronization problem. The full code is attached and included below.

The code is meant to work for arbitrary array width, array height, number of thread blocks, and number of threads per block. When choosing a thread size of 33 (i.e. one more thread than a full warp), the 33rd thread doesn’t synchronize with [font=“Courier New”]__syncthreads()[/font] is called. This causes problems with the output data. The problem is only present when there is more than one warp, and the width of the array is more than the number of threads (e.g. with width=35 and 34 threads).

The following is a downsized example of what happens (in reality the array would need to have more elements for the kernel to produce the error).

Initial array:

0 1 2 3 4

5 6 7 8 9

Expected Result:

4 0 1 2 3

9 5 6 7 8

Kernel Produces:

4 0 1 2 3

8 5 6 7 8

The first line is done correctly (for each block if there are more than one), with all subsequent lines having the second last value repeated. I have tested this one two different cards (8600GT and GTX280) and get the same results. I would like to know if this is just a bug with my kernel, or a problem that can’t be fixed by adjusting my code?

The full source file is attached, and included below for convenience.

Thank you.

#include <cstdio>

#include <cstdlib>

// A method to ensure all reads use the same logical layout.

inline __device__ __host__ int loc(int x, int y, int width)

{

  return y*width + x;

}

//kernel to shift all items in a 2D array one position to the right (wrapping around rows)

__global__ void shiftRight ( int* globalArray, int width, int height)

{

  int temp1=0;			//temporary swap variables

  int temp2=0;

int blockRange=0;		//the number of rows that a single block will shift

if (height%gridDim.x==0)	//logic to account for awkward array sizes

    blockRange = height/gridDim.x;

  else

    blockRange = (1+height/gridDim.x);

int yStart = blockIdx.x*blockRange;

  int yEnd = yStart+blockRange;	//the end condition for the y-loop

  yEnd = min(height,yEnd);				//make sure that the array doesn't go out of bounds

for (int y = yStart; y < yEnd ; ++y)

  {

    //do the first read so the swap variables are loaded for the x-loop

    temp1 = globalArray[loc(threadIdx.x,y,width)];

    //Each block shifts an entire row by itself, even if there are more columns than threads

    for (int threadXOffset = threadIdx.x  ; threadXOffset < width ; threadXOffset+=blockDim.x)

    {

      //blockDim.x is added so that we store the next round of values

      //this has to be done now, because the next operation will

      //overwrite one of these values

      temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)];

      __syncthreads();	//sync before the write to ensure all the values have been read

      globalArray[loc((threadXOffset +1)%width,y,width)] = temp1;

      __syncthreads();	//sync after the write so ensure all the values have been written

      temp1 = temp2;		//swap the storage variables.

    }

    if (threadIdx.x == 0 && y == 0)

      globalArray[loc(12,2,width)]=globalArray[67];

  }

}

int main (int argc, char* argv[])

{

  //set the parameters to be used

  int width = 34;

  int height = 3;

  int threadsPerBlock=33;

  int numBlocks = 1;

int memSizeInBytes = width*height*sizeof(int);

//create the host data and assign each element of the array to equal its index

  int* hostData = (int*) malloc (memSizeInBytes);

  for (int y = 0 ; y < height ; ++y)

    for (int x = 0 ; x < width ; ++x)

      hostData [loc(x,y,width)] = loc(x,y,width);

//create an allocate the device pointers

  int* deviceData;

  cudaMalloc ( &deviceData  ,memSizeInBytes);

  cudaMemset (  deviceData,0,memSizeInBytes);

  cudaMemcpy (  deviceData, hostData, memSizeInBytes, cudaMemcpyHostToDevice);

  cudaThreadSynchronize();

//launch the kernel

  shiftRight<<<numBlocks,threadsPerBlock>>> (deviceData, width, height);

  cudaThreadSynchronize();

//copy the device data to a host array

  int* hostDeviceOutput = (int*) malloc (memSizeInBytes);

  cudaMemcpy (hostDeviceOutput, deviceData, memSizeInBytes, cudaMemcpyDeviceToHost); 

  cudaFree (deviceData);

//Print out the expected/desired device output

  printf("---- Expected Device Output ----\n");

  printf("   | ");

  for (int x = 0 ; x < width ; ++x)

    printf("%4d ",x);

  printf("\n---|-");

  for (int x = 0 ; x < width ; ++x)

    printf("-----");

  for (int y = 0 ; y < height ; ++y)

  {

    printf("\n%2d | ",y);

    for (int x = 0 ; x < width ; ++x)

      printf("%4d ",hostData[loc((x-1+width)%width,y,width)]);

  }

  printf("\n\n");

printf("---- Actual Device Output ----\n");

  printf("   | ");

  for (int x = 0 ; x < width ; ++x)

    printf("%4d ",x);

  printf("\n---|-");

  for (int x = 0 ; x < width ; ++x)

    printf("-----");

  for (int y = 0 ; y < height ; ++y)

  {

    printf("\n%2d | ",y);

    for (int x = 0 ; x < width ; ++x)

      printf("%4d ",hostDeviceOutput[loc(x,y,width)]);

  }

  printf("\n\n");

}

main.cu (3.51 KB)

You are using __syncthreads() inside a for loop, which is only allowed if all threads execute the same number of loop iterations. In your case, they don’t.

Thanks. That explains everything.

I adjusted the kernel to ensure that all threads within a block were in the same place whenever [font=“Courier New”]__syncthreads()[/font] is called. This fixed the issue for that case. However, a slightly stranger problem has now arisen.

If num_threads>=385 (13 or more warps) and array width>=num_threads, when I run the adjusted kernel on the 8600GT (4 multiprocessors), global memory writes don’t happen when they should, but when I run it on the GTX 280 everything is fine (30 multiprocessors). I then tweaked the adjusted kernel (see end of post), so that it sets all the values of globalArray to 9000, but on the 8600GT card the values are not set to 9000.

The kernel is executed with the exact same framework as in my initial post (and I have attached the full adjusted source). Also if either one of the global memory writes (“LINE A” or “LINE B”) are commented out, the kernel executes correctly and all the values are 9000. Calling cudaThreadSynchronize() and looking at the error message after the kernel is called gives “no error”.

I assume I’ve made an amateur mistake somewhere (as I did previously), but I just can’t see where I’m going wrong.

Adjusted kernel that also writes 9000 to each array element (multiple times…):

__global__ void shiftRight ( int* globalArray, int width, int height)

{

	for (int i = 0 ; i < width*height ; ++i)  //the write also doesn't happen if this for-loop is replaced with "globalArray[threadIdx.x]=9000;"

	{

		globalArray[i]=9000;

	}

	__syncthreads();

	int temp1=0;                  //temporary swap variables

	int temp2=0;

	int blockRange=0;             //the number of rows that a single block will shift

	if (height%gridDim.x==0)      //logic to account for awkward array sizes

		blockRange = height/gridDim.x;

	else

		blockRange = (1+height/gridDim.x);

	int yStart = blockIdx.x*blockRange;

	int yEnd = yStart+blockRange;            //the end condition for the y-loop

	yEnd = min(height,yEnd);                 //make sure that the array doesn't go out of bounds

	int xEnd = width - width%blockDim.x;     //set the end of the x-loop to the highest multiple of blockDim.x less than width.

	int threadXOffset = 0;

	for (int y = yStart; y < yEnd ; ++y)

	{

		//do the first read so the swap variables are loaded for the x-loop

		temp1 = globalArray[loc(threadIdx.x,y,width)];

		//Each block shifts an entire row by itself, even if there are more columns than threads

		for (threadXOffset = threadIdx.x  ; threadXOffset < xEnd ; threadXOffset+=blockDim.x)

		{

			//blockDim.x is added so that we store the next round of values

			//this has to be done now, because the next operation will

			//overwrite one of these values

			temp2 = globalArray[loc((threadXOffset + blockDim.x)%width,y,width)];

	    	__syncthreads();  //sync before the write to ensure all the values have been read

	    	globalArray[loc((threadXOffset + 1)%width,y,width)] = temp1;             ////////////////////// LINE A

	    	__syncthreads();  //sync after the write so ensure all the values have been written

			temp1 = temp2;            //swap the storage variables.

		}

		if (threadXOffset < width)

			globalArray[loc((threadXOffset + 1)%width,y,width)] = temp1;    /////////////////////// LINE B

	}

}

main.cu (4.03 KB)

Without having looked at the code: __syncthreads() does not synchronize different blocks. The only way to get inter-block synchronization are separate kernel invocations.

I have taken that into account. Also, my test case only uses one block to avoid lack of block synchronization causing any issues. On top of all that, the first line of the kernel is “all threads write 9000 to all positions in the array”, which is then ignored when the kernel is run (there was the same result with each thread writing 9000 to different position(s) in the array).

I have done some more testing (there was a slightly different, but seemingly related error on the GTX 280), and now I’m pretty sure that this is a problem with CUDA and not a problem with the logic in my code. If I hard code certain values or pass them in as parameters (calculated using that same code that would have been used in the kernel), then everything works as expected. I know that the calculations in the kernel are correct because copying them to a test array and copying them to the host shows they are what I expected them to be (on top of this, the values aren’t copied to the test array when the kernel isn’t working properly). Unfortunately hard-coding or passing in these values is not an option as they are dependent on blockIdx when using multiple blocks.

Looks like I’m going to have to look for an alternate, less efficient way to solve this… Unless I’m still missing something…

(By the way I’m using the latest toolkit [downloaded and installed today] with driver version 260.19.06 [the nvidia-current drivers in Ubuntu 10.10])

Edit:

Turns out changing the loop in my kernel to the following allows it to work:

for (y =  blockIdx.x*block_range ; y < y_end ; ++y)

{

  temp1 = df[getCell2D(threadIdx.x,y,pitch)];

  //each block has to do an entire row by itself, so it has to go to width

  for (int block_x_offset = 0  ; block_x_offset < width ; block_x_offset+=blockDim.x)

  {

    thread_x_offset = block_x_offset + threadIdx.x;

    //adding blockDim.x because we're storing the next set of df's

    if (thread_x_offset < width)

      temp2 = df[getCell2D((thread_x_offset + blockDim.x)%width,y,pitch)];

    __syncthreads();

    if (thread_x_offset < width)

      df[getCell2D((thread_x_offset + 1)%width,y,pitch)] = temp1;

    __syncthreads();

    temp1 = temp2;

  }

}

While this solves my problem, it doesn’t explain why I was such weird behaviour with my solution.