Why doesn't work this __global__ function with grid dim of 2D?

Hi people, my code is:

__global__ void sumVect(QVECT *Vett, QVECT *VettRis, unsigned long int N, unsigned long int N2){

int tid, sh, linIdx, i, z, j;

sh=threadIdx.x+blockDim.x*threadIdx.y;

        tid = blockDim.x*blockDim.y*blockIdx.x + gridDim.x* blockIdx.y + sh;

if (tid<N2){

                linIdx=N2-tid;

                i=int(N - 0.5 - sqrt(0.25 - 2 * (1 - linIdx)));

                z=(N+N-1-i)*i;

                j=tid - z/2 + 1 + i;

if (i==j){

                        i=i-1;

                        j=N-1;

                }

VettRis[tid].x=Vett[i].x+Vett[j].x;

                VettRis[tid].y=Vett[i].y+Vett[j].y;

                VettRis[tid].z=Vett[i].z+Vett[j].z;

                VettRis[tid].Ene=Vett[i].Ene+Vett[j].Ene;

}

Practically this kernel function sums N numbers in parallel in pairs without repetition. Now, this code works when launch kernel with grid dimension of 1D (example: kernel<<<dim3(65535,1,1),…>>>(…); ). On the contrary, this code doesn’t work when launch kernel with grid dimension of 2D (example: kernel<<<dim3(65535,1000,1),…>>>(…); ). The problem is that it doesn’t sum the last elements of the vectors, with all that the number of threads is sufficient to perform the calculation. Thanks a lot!

Is your tid created properly? I’m a bit confused how you are create the index.

Said that, the usual way to create an index from a 2D (x,y) array to an unique 1D index (in row-major order) in CUDA is:

// map from threadIdx/BlockIdx to (x,y) position

	int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

	// calculate the offset of an element into the input array (row-major order)

	int gid = x + y * NUMCOLS;

Hope this help.

I haven’t a matrix but a vector and I sum the elements in this vector without repetitions. Example:

VettRis[0]=Vett[0]+Vett[1];

VettRis[1]=Vett[0]+Vett[2];

VettRis[2]=Vett[0]+Vett[3];

VettRis[i]=Vett[0]+Vett[N-1];

VettRis[i+1]=Vett[1]+Vett[2];

VettRis[i+2]=Vett[1]+Vett[3];

VettRis[j]=Vett[1]+Vett[N-1];

VettRis[j+1]=Vett[2]+Vett[3];

VettRis[j+2]=Vett[2]+Vett[4];

ect.

for grid dim 2D intend when launch the kernel with the configuration: kernel<<<dim3(x,y,1),…>>>(…);

I hope to have been clearer.

pQB’s comment is correct, your formula for [font=“Courier New”]tid[/font] is wrong. You probably meant to write

sh=threadIdx.x+blockDim.x*threadIdx.y;

        tid = blockDim.x*blockDim.y * (blockIdx.x + gridDim.x* blockIdx.y) + sh;

even though that would be an unusual mapping.

And there is no point in using a 2D block when you only ever used the flattened 1D [font=“Courier New”]tid[/font]. You can just as well use a 1D block from the beginning.

Thanks a lot tera, the problem were the brackets! THANK YOU, THANK YOU, THANK YOU!!! :D