I’m trying to do something simple in cuda:

- copy a small array to the shared memory
- each thread should calculate de sum of the array and store in a variable inside kernel.

```
__global__
void convolution(int* data,int* buff,float* kernel,int* outputvars,int rows,int cols,int channels,int kerneldim){
int idx = getGlobalIdx_3D_3D();
int kernelmid;
extern __shared__ float sharedKernel[];
float *kernelCenter;
if (idx==0){
*(outputvars)=INT_MAX;
*(outputvars+1)=INT_MIN;
*(outputvars+2)=INT_MAX;
*(outputvars+3)=INT_MIN;
}
if (getblockthreadIdx()<kerneldim*kerneldim){ //here i copy the values from the global memory to the shared memory.
*(sharedKernel+getblockthreadIdx())=*(kernel+getblockthreadIdx());
}
kernelmid = kerneldim%2==1?kerneldim/2:(kerneldim-1)/2;
kernelCenter=kernel+(((kerneldim+1)*kernelmid));
__syncthreads();
/*
if (getblockthreadIdx()<kerneldim*kerneldim){
printf("%d %f\n",getblockthreadIdx(),*(sharedKernel+getblockthreadIdx()));
}
__syncthreads();
*/
int row = idx / (cols*channels);
int col = (idx%(cols*channels))/channels;
float value=0;
int pixel=0;
float kernelVal=0;
int pixelmin=INT_MAX,pixelmax=INT_MIN;
int kernelmidHalf=(kerneldim/2);
if (col>0 && row>0 && row<rows-1 && col<cols-1){
data = data+idx;
for(int r = (-1*kernelmidHalf); r<=kernelmidHalf;r++){
for(int c = -1*kernelmidHalf; c<=kernelmidHalf;c++){
kernelVal=*(kernelCenter+(r*-1*kerneldim)+(c*-1));
value+=kernelVal;
}
}
*(buff+idx)=value;
atomicMin(outputvars,value);
atomicMax(outputvars+1,value);
atomicMin(outputvars+2,pixelmin);
atomicMax(outputvars+3,pixelmax);
}
}
```

and kernel has this values:

```
float kernel[3][3]={
{1,0,-1},
{2,0,-2},
{1,0,-1},
};
```

I know this is not an optimized code, i’m just translating some a convolution code to CUDA before optimize it.

The problem is that when i print value i’m getting “-4” and should be getting “0”. by some reason the first “column” (the last added) ins’t being added, the fun part is that if i add a simple **printf** inside the second loop, everything works fine. i also know that both loops are working fine beacase when i change **kernelVal** with:

```
kernelVal=((r+1)*3)+(c+1)
```

I get “36”, that is the sum of numbers from 1 to 8. I’m assuming it can be a race condition but still don’t know to solve it and cuda doesn’t show any error neither.