Taking the sum over the last 2 dimensions of a 3D array A

Hi,

I am trying to take the sum over the last 2 dimension of a 3 dimension array A, that is the sum over the irot and ik dimension. I am not sure that I doing this right. Help will be apreciated.

I have attached the code for the kernel below.

Thanks

/* summing the 3D matrix in the nrot and nk dimension */
extern "C" __global__ void
psum_23D( float *A, float *partial, float *temp, int npart, int nrot, int nk) {

  int ipart = blockIdx.x * blockDim.x + threadIdx.x;
  int  irot = blockIdx.y * blockDim.y + threadIdx.y;
  int    ik = blockIdx.z * blockDim.z + threadIdx.z;

  int sharedIndex_x = threadIdx.x;
  int sharedIndex_y = threadIdx.y;
  int sharedIndex_z = threadIdx.z;
  //allocating the shared memory array
  __shared__ float shared_A[16][16][4];
  __shared__ float temp_A[16];

  temp[ipart] = 0.0;
  while ( ipart < npart ){
    while ( irot < nrot ){
      while ( ik < nk ){
        temp_A[sharedIndex_x] += A[(irot+nrot*ik)*npart+ipart];
        ik += blockDim.z * gridDim.z;
      }
      irot += blockDim.y * gridDim.y;
    }
    ipart += blockDim.x * gridDim.x;
  }
  shared_A[sharedIndex_x][sharedIndex_y][sharedIndex_z] = temp_A[sharedIndex_x];
  //synchronizing the threads
  __syncthreads();

  //doing the 1/2 reduction sum at each step
  int j = blockDim.y / 2.0;
  while ( j != 0 ) {
    if ( sharedIndex_y < j ) {
      int k = blockDim.z / 2.0;
      while ( k != 0 ) {
        if ( sharedIndex_z < k ) shared_A[sharedIndex_x][sharedIndex_y][sharedIndex_z] += 
                                 shared_A[sharedIndex_x][sharedIndex_y + j][sharedIndex_z + k];
        __syncthreads();
        k /= 2;
      }
    }
    __syncthreads();
    j /= 2;
  }


  if ( sharedIndex_y == 0 ) {
    if ( sharedIndex_z == 0 ) {
      partial[ipart * npart + blockIdx.x] = shared_A[sharedIndex_x][0][0];
    }
  }

}
  1.    temp_A[sharedIndex_x] += A[(irot+nrot*ik)*npart+ipart];
    

as this line may be executed at once by plural threads, data-race may occur.
had better to change as : atomicAdd(&temp_A[sharedIndex_x], A[(irot+nrot*ik)*npart+ipart]);

Since sharedIndex_x is unique for each thread in the block (sharedIndex_x = threadIdx.x) and temp_A is in shared memory, which is unique per threadblock, how could a data race occur there?

oops… you’re right.

another questionable: is shared memory surely initialized to 0?

NO, shared memory is not default-initialized, and the code appears to have a deficiency there.

And actually, you are correct. I missed that the threadblock structure here appears to be multi-dimensional, and so multiple threads in the block could indeed have the same threadIdx.x. So your statements are all correct, mine are wrong.

Yes, the temp_A[shareIndex_x] should be initialised to 0 before the sum, I corrected that into

/* summing the 3D matrix in the nrot and nk dimension */
extern "C" __global__ void
psum_23D( float *A, float *partial, float *temp, int npart, int nrot, int nk) {

  int ipart = blockIdx.x * blockDim.x + threadIdx.x;
  int  irot = blockIdx.y * blockDim.y + threadIdx.y;
  int    ik = blockIdx.z * blockDim.z + threadIdx.z;

  int sharedIndex_x = threadIdx.x;
  int sharedIndex_y = threadIdx.y;
  int sharedIndex_z = threadIdx.z;
  //allocating the shared memory array
  __shared__ float shared_A[16][16][4];
  __shared__ float temp_A[16];

  temp_A[sharedIndex_x] = 0.0;
  while ( ipart < npart ){
    while ( irot < nrot ){
      while ( ik < nk ){
        temp_A[sharedIndex_x] += A[(irot+nrot*ik)*npart+ipart];
        ik += blockDim.z * gridDim.z;
      }
      irot += blockDim.y * gridDim.y;
    }
    ipart += blockDim.x * gridDim.x;
  }
  shared_A[sharedIndex_x][sharedIndex_y][sharedIndex_z] = temp_A[sharedIndex_x];
  //syncronizing the threads
  __syncthreads();

  //practising the 1/2 reducion sum at each step
  int j = blockDim.y / 2.0;
  while ( j != 0 ) {
    if ( sharedIndex_y < j ) {
      int k = blockDim.z / 2.0;
      while ( k != 0 ) {
        if ( sharedIndex_z < k ) shared_A[sharedIndex_x][sharedIndex_y][sharedIndex_z] +=
                                   shared_A[sharedIndex_x][sharedIndex_y + j][sharedIndex_z + k];
        __syncthreads();
        k /= 2;
      }
    }
    __syncthreads();
    j /= 2;
  }

  //partial[blockIdx.x] = 1.0;

  if ( sharedIndex_y == 0 ) {
    if ( sharedIndex_z == 0 ) {
      partial[ipart + npart * blockIdx.y] = shared_A[sharedIndex_x][0][0];
    }
  }

}

The second part of the code where I do the 1/2 reduction in the 2nd and 3rd dimension is where I am not sure that it is corret. And also the code

if ( sharedIndex_y == 0 ) {
      if ( sharedIndex_z == 0 ) {
        partial[ipart + npart * blockIdx.y] = shared_A[sharedIndex_x][0][0];
      }
    }

I have allocated partial as

int threadsPerBlock_npart = 16;
    int blocksPerGrid_npart = npart/16.0+(npart%16!=0);

    int size_p_c_npart = npart*blocksPerGrid_npart*sizeof(float);
    // summed value
    float *d_partial_c_npart = NULL;
    err = cudaMalloc((void**)&d_partial_c_npart, size_p_c_npart);

So the size should be ok. But then again I may have miss something. Thr final outcome I trying to get is an array that looks like partial[npart][blocksPerGrid_npart] which I then sum to get

for ( int ipart = 0 ; ipart < npart ; ipart++ ) {
    h_r_vec[ipart] = 0.0;
    for ( int igrid = 0 ; igrid < blocksPerGrid_npart ; igrid++ ) {
      h_r_vec[ipart] += partial_c_npart[ipart + npart * igrid];
    }
  }

It is summing but the results is not correct probbaly because the array entris get mis-placed.