using syncthreads still at n00b status

Hi all,

the use of syncthreads has got me confused.

I’m trying to do something really simple:

  1. load an original array into a shared memory array

  2. write the contents of the shared memory array into a global memory array

  3. print out the contents of the global memory array. I should get global memory array = original array

everything works fine if i don’t use __syncthreads();

but if I use use __syncthreads(); after loading original array into global memory array then my output is wrong.

why is this?

my code:

resultIndex = w+h*dest->width;

extern __shared__ int sArray[];

int* sKernel = (int*)&sArray;

//put a small matrix into shared memory of each block

if(threadIdx.y < kernel->width && threadIdx.x < kernel->width)

	sKernel[threadIdx.x+threadIdx.y*matrixWidth] = kernel->matrixGPUelement(threadIdx.x+threadIdx.y*matrixWidth);

__syncthreads(); //this screws up my result

output[resultIndex]=sKernel[threadIdx.x+threadIdx.y*matrixWidth];

What are the variables w, h, and dest->width set to?

h=blockIdx.y * blockDim.y + threadIdx.y;

w=blockIdx.x * blockDim.x + threadIdx.x;

dest->width = 64 for my testing

and kernel->width = matrixWidth = 3

Hmm, two comments:

  1. __syncthreads() appears to be unnecessary for the code you post because each thread writes to and reads from a separate shared memory location. (In fact, shared memory is completely unnecessary in this code fragment, but that might be because you deleted some lines to simplify the example for the forum.) __syncthreads() is needed if you are going to have two different threads read and write to the same shared memory location at different points in your code. The execution barrier ensures that all the writes finish before different threads read the values. If each thread reads the same location it wrote to, the barrier is not required.

  2. Adding the __syncthreads() should do nothing to this code, so the fact that you get a wrong answer is still concerning. Do you check for return codes from all the CUDA function calls?

thanks for the fast reply.

I worked on the code some more and found out __syncthreads() wasn’t the source of the problem. I still don’t know what’s causing the problem, however, and my kernel wasn’t returning any errors.

Here’s a little more of my code, hopefully it will help pinpoint the problem.

let me define some variables…

int* sSub = (int*)&sArray;				//shared memory array for sub-sample

	int* sKernel = (int*)&sArray[256*4];	//shared memory array for kernel

        resultIndex = w+h*dest->width;	//for the grid

        sampleIndex = w+h*image->width;	//for the grid

	kernelIndex = threadIdx.x+threadIdx.y*kernel->width;	//for every block

doing this works, it prints out the kernel in every block:

//place kernel in shared memory

	//one kernel in every block, all kernels fit in block

	if(threadIdx.y < kernel->width && threadIdx.x < kernel->width)

		sKernel[kernelIndex] = kernel->matrixGPUelement(kernelIndex);

	

	output[resultIndex]= sKernel[kernelIndex];

doing this works too, it prints out the matrix i’m working with:

//place a submatrix of image into shared memory

	sSub[sampleIndex] = image->matrixGPUelement(sampleIndex);

	output[resultIndex]= sSub[sampleIndex];

but this doesn’t work, which should print out kernels in every block:

//place kernel in shared memory

	//one kernel in every block, all kernels fit in block

	if(threadIdx.y < kernel->width && threadIdx.x < kernel->width)

		sKernel[kernelIndex] = kernel->matrixGPUelement(kernelIndex);

		

	//place a submatrix of image into shared memory

	sSub[sampleIndex] = image->matrixGPUelement(sampleIndex);//I simply added this line of code 

	

	output[resultIndex]= sKernel[kernelIndex];

why does adding that line there mess up my result?

thanks in advance!