not reading all values from array

I’m trying to do something simple in cuda:

  1. copy a small array to the shared memory
  2. each thread should calculate de sum of the array and store in a variable inside kernel.
        void convolution(int* data,int* buff,float* kernel,int* outputvars,int rows,int cols,int channels,int kerneldim){
        	int idx = getGlobalIdx_3D_3D();
        	int kernelmid;
        	extern __shared__ float sharedKernel[];
        	float *kernelCenter;
        	if (idx==0){
        	if (getblockthreadIdx()<kerneldim*kerneldim){ //here i copy the values from the global memory to the shared memory.
        	kernelmid = kerneldim%2==1?kerneldim/2:(kerneldim-1)/2;
        	if (getblockthreadIdx()<kerneldim*kerneldim){
        		printf("%d %f\n",getblockthreadIdx(),*(sharedKernel+getblockthreadIdx()));
        	int row = idx / (cols*channels);
        	int col = (idx%(cols*channels))/channels;
        	float value=0;
        	int pixel=0;
        	float kernelVal=0;
        	int pixelmin=INT_MAX,pixelmax=INT_MIN;
        	int kernelmidHalf=(kerneldim/2);
        	if (col>0 && row>0 && row<rows-1 && col<cols-1){
        		data = data+idx;
        		for(int r = (-1*kernelmidHalf); r<=kernelmidHalf;r++){
        			for(int c = -1*kernelmidHalf; c<=kernelmidHalf;c++){

and kernel has this values:

float kernel[3][3]={

I know this is not an optimized code, i’m just translating some a convolution code to CUDA before optimize it.

The problem is that when i print value i’m getting “-4” and should be getting “0”. by some reason the first “column” (the last added) ins’t being added, the fun part is that if i add a simple printf inside the second loop, everything works fine. i also know that both loops are working fine beacase when i change kernelVal with:


I get “36”, that is the sum of numbers from 1 to 8. I’m assuming it can be a race condition but still don’t know to solve it and cuda doesn’t show any error neither.

After writing this i decided to do a little change in the code and look working now, i replaced the for loop conditions from:

or(int r = (-1*kernelmidHalf); r<=kernelmidHalf;r++){
        			for(int c = -1*kernelmidHalf; c<=kernelmidHalf;c++){


or(int r = (-1*kernelmidHalf); r<(kernelmidHalf+1);r++){
        			for(int c = -1*kernelmidHalf; c<(kernelmidHalf+1);c++){

and it’s working. looks like less or equal than fails by some reason on cuda. its like the “equal” part ins’t taken in count. still dont knowing why it worked putting a “printf” inside the loop.

__syncthreads() synchronizes only inside a single thread block. read f.e.

Note that thread order may be arbitrary, and thread with idx=0 may be started only once all threads with larger indexes are finished. so you can’t use it to initialize shared global vars. use a separate kernel or cuda memcpy

hi bulat, you are right, i’m initializing that variable outside the kernel now.