# Taking the sum over the last 2 dimensions of a 3D array A

Hi,

I am trying to take the sum over the last 2 dimension of a 3 dimension array A, that is the sum over the irot and ik dimension. I am not sure that I doing this right. Help will be apreciated.

I have attached the code for the kernel below.

Thanks

``````/* summing the 3D matrix in the nrot and nk dimension */
extern "C" __global__ void
psum_23D( float *A, float *partial, float *temp, int npart, int nrot, int nk) {

int ipart = blockIdx.x * blockDim.x + threadIdx.x;
int  irot = blockIdx.y * blockDim.y + threadIdx.y;
int    ik = blockIdx.z * blockDim.z + threadIdx.z;

//allocating the shared memory array
__shared__ float shared_A[16][16][4];
__shared__ float temp_A[16];

temp[ipart] = 0.0;
while ( ipart < npart ){
while ( irot < nrot ){
while ( ik < nk ){
temp_A[sharedIndex_x] += A[(irot+nrot*ik)*npart+ipart];
ik += blockDim.z * gridDim.z;
}
irot += blockDim.y * gridDim.y;
}
ipart += blockDim.x * gridDim.x;
}
shared_A[sharedIndex_x][sharedIndex_y][sharedIndex_z] = temp_A[sharedIndex_x];

//doing the 1/2 reduction sum at each step
int j = blockDim.y / 2.0;
while ( j != 0 ) {
if ( sharedIndex_y < j ) {
int k = blockDim.z / 2.0;
while ( k != 0 ) {
if ( sharedIndex_z < k ) shared_A[sharedIndex_x][sharedIndex_y][sharedIndex_z] +=
shared_A[sharedIndex_x][sharedIndex_y + j][sharedIndex_z + k];
k /= 2;
}
}
j /= 2;
}

if ( sharedIndex_y == 0 ) {
if ( sharedIndex_z == 0 ) {
partial[ipart * npart + blockIdx.x] = shared_A[sharedIndex_x][0][0];
}
}

}
``````
1. ``````   temp_A[sharedIndex_x] += A[(irot+nrot*ik)*npart+ipart];
``````

as this line may be executed at once by plural threads, data-race may occur.

Since sharedIndex_x is unique for each thread in the block (sharedIndex_x = threadIdx.x) and temp_A is in shared memory, which is unique per threadblock, how could a data race occur there?

oops… you’re right.

another questionable: is shared memory surely initialized to 0?

NO, shared memory is not default-initialized, and the code appears to have a deficiency there.

And actually, you are correct. I missed that the threadblock structure here appears to be multi-dimensional, and so multiple threads in the block could indeed have the same threadIdx.x. So your statements are all correct, mine are wrong.

Yes, the temp_A[shareIndex_x] should be initialised to 0 before the sum, I corrected that into

``````/* summing the 3D matrix in the nrot and nk dimension */
extern "C" __global__ void
psum_23D( float *A, float *partial, float *temp, int npart, int nrot, int nk) {

int ipart = blockIdx.x * blockDim.x + threadIdx.x;
int  irot = blockIdx.y * blockDim.y + threadIdx.y;
int    ik = blockIdx.z * blockDim.z + threadIdx.z;

//allocating the shared memory array
__shared__ float shared_A[16][16][4];
__shared__ float temp_A[16];

temp_A[sharedIndex_x] = 0.0;
while ( ipart < npart ){
while ( irot < nrot ){
while ( ik < nk ){
temp_A[sharedIndex_x] += A[(irot+nrot*ik)*npart+ipart];
ik += blockDim.z * gridDim.z;
}
irot += blockDim.y * gridDim.y;
}
ipart += blockDim.x * gridDim.x;
}
shared_A[sharedIndex_x][sharedIndex_y][sharedIndex_z] = temp_A[sharedIndex_x];

//practising the 1/2 reducion sum at each step
int j = blockDim.y / 2.0;
while ( j != 0 ) {
if ( sharedIndex_y < j ) {
int k = blockDim.z / 2.0;
while ( k != 0 ) {
if ( sharedIndex_z < k ) shared_A[sharedIndex_x][sharedIndex_y][sharedIndex_z] +=
shared_A[sharedIndex_x][sharedIndex_y + j][sharedIndex_z + k];
k /= 2;
}
}
j /= 2;
}

//partial[blockIdx.x] = 1.0;

if ( sharedIndex_y == 0 ) {
if ( sharedIndex_z == 0 ) {
partial[ipart + npart * blockIdx.y] = shared_A[sharedIndex_x][0][0];
}
}

}
``````

The second part of the code where I do the 1/2 reduction in the 2nd and 3rd dimension is where I am not sure that it is corret. And also the code

``````if ( sharedIndex_y == 0 ) {
if ( sharedIndex_z == 0 ) {
partial[ipart + npart * blockIdx.y] = shared_A[sharedIndex_x][0][0];
}
}
``````

I have allocated partial as

``````int threadsPerBlock_npart = 16;
int blocksPerGrid_npart = npart/16.0+(npart%16!=0);

int size_p_c_npart = npart*blocksPerGrid_npart*sizeof(float);
// summed value
float *d_partial_c_npart = NULL;
err = cudaMalloc((void**)&d_partial_c_npart, size_p_c_npart);
``````

So the size should be ok. But then again I may have miss something. Thr final outcome I trying to get is an array that looks like partial[npart][blocksPerGrid_npart] which I then sum to get

``````for ( int ipart = 0 ; ipart < npart ; ipart++ ) {
h_r_vec[ipart] = 0.0;
for ( int igrid = 0 ; igrid < blocksPerGrid_npart ; igrid++ ) {
h_r_vec[ipart] += partial_c_npart[ipart + npart * igrid];
}
}
``````

It is summing but the results is not correct probbaly because the array entris get mis-placed.