__global__ void sumVect(QVECT *Vett, QVECT *VettRis, unsigned long int N, unsigned long int N2){
int tid, sh, linIdx, i, z, j;
sh=threadIdx.x+blockDim.x*threadIdx.y;
tid = blockDim.x*blockDim.y*blockIdx.x + gridDim.x* blockIdx.y + sh;
if (tid<N2){
linIdx=N2-tid;
i=int(N - 0.5 - sqrt(0.25 - 2 * (1 - linIdx)));
z=(N+N-1-i)*i;
j=tid - z/2 + 1 + i;
if (i==j){
i=i-1;
j=N-1;
}
VettRis[tid].x=Vett[i].x+Vett[j].x;
VettRis[tid].y=Vett[i].y+Vett[j].y;
VettRis[tid].z=Vett[i].z+Vett[j].z;
VettRis[tid].Ene=Vett[i].Ene+Vett[j].Ene;
}

Practically this kernel function sums N numbers in parallel in pairs without repetition. Now, this code works when launch kernel with grid dimension of 1D (example: kernel<<<dim3(65535,1,1),…>>>(…); ). On the contrary, this code doesn’t work when launch kernel with grid dimension of 2D (example: kernel<<<dim3(65535,1000,1),…>>>(…); ). The problem is that it doesn’t sum the last elements of the vectors, with all that the number of threads is sufficient to perform the calculation. Thanks a lot!

Is your tid created properly? I’m a bit confused how you are create the index.

Said that, the usual way to create an index from a 2D (x,y) array to an unique 1D index (in row-major order) in CUDA is:

// map from threadIdx/BlockIdx to (x,y) position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// calculate the offset of an element into the input array (row-major order)
int gid = x + y * NUMCOLS;

And there is no point in using a 2D block when you only ever used the flattened 1D [font=“Courier New”]tid[/font]. You can just as well use a 1D block from the beginning.