write the contents of the shared memory array into a global memory array
print out the contents of the global memory array. I should get global memory array = original array
everything works fine if i don’t use __syncthreads();
but if I use use __syncthreads(); after loading original array into global memory array then my output is wrong.
why is this?
my code:
resultIndex = w+h*dest->width;
extern __shared__ int sArray[];
int* sKernel = (int*)&sArray;
//put a small matrix into shared memory of each block
if(threadIdx.y < kernel->width && threadIdx.x < kernel->width)
sKernel[threadIdx.x+threadIdx.y*matrixWidth] = kernel->matrixGPUelement(threadIdx.x+threadIdx.y*matrixWidth);
__syncthreads(); //this screws up my result
output[resultIndex]=sKernel[threadIdx.x+threadIdx.y*matrixWidth];
__syncthreads() appears to be unnecessary for the code you post because each thread writes to and reads from a separate shared memory location. (In fact, shared memory is completely unnecessary in this code fragment, but that might be because you deleted some lines to simplify the example for the forum.) __syncthreads() is needed if you are going to have two different threads read and write to the same shared memory location at different points in your code. The execution barrier ensures that all the writes finish before different threads read the values. If each thread reads the same location it wrote to, the barrier is not required.
Adding the __syncthreads() should do nothing to this code, so the fact that you get a wrong answer is still concerning. Do you check for return codes from all the CUDA function calls?
I worked on the code some more and found out __syncthreads() wasn’t the source of the problem. I still don’t know what’s causing the problem, however, and my kernel wasn’t returning any errors.
Here’s a little more of my code, hopefully it will help pinpoint the problem.
let me define some variables…
int* sSub = (int*)&sArray; //shared memory array for sub-sample
int* sKernel = (int*)&sArray[256*4]; //shared memory array for kernel
resultIndex = w+h*dest->width; //for the grid
sampleIndex = w+h*image->width; //for the grid
kernelIndex = threadIdx.x+threadIdx.y*kernel->width; //for every block
doing this works, it prints out the kernel in every block:
//place kernel in shared memory
//one kernel in every block, all kernels fit in block
if(threadIdx.y < kernel->width && threadIdx.x < kernel->width)
sKernel[kernelIndex] = kernel->matrixGPUelement(kernelIndex);
output[resultIndex]= sKernel[kernelIndex];
doing this works too, it prints out the matrix i’m working with:
//place a submatrix of image into shared memory
sSub[sampleIndex] = image->matrixGPUelement(sampleIndex);
output[resultIndex]= sSub[sampleIndex];
but this doesn’t work, which should print out kernels in every block:
//place kernel in shared memory
//one kernel in every block, all kernels fit in block
if(threadIdx.y < kernel->width && threadIdx.x < kernel->width)
sKernel[kernelIndex] = kernel->matrixGPUelement(kernelIndex);
//place a submatrix of image into shared memory
sSub[sampleIndex] = image->matrixGPUelement(sampleIndex);//I simply added this line of code
output[resultIndex]= sKernel[kernelIndex];
why does adding that line there mess up my result?