Hi,

This is a sub-problem of a larger problem, but I am stuck on it for considerably large time and not able to get the right output. Please help in correcting my mistakes and suggesting right solution.

Problem

Assume a square matrix, as input

Square Matrix

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

Output Matrix

5

9 10

13 14 15

Output sum= 66

I am trying to print out the sum of elements of sqaure Matrix, below the diognal (5, 9, 10, 13, 14, 15), but somehow the not able to do so.First I converted Matrix into array[row-major matrix]. to write the function for the host, but somehow when making threads(kernel), the result is not correct.

Please indicate what I am doing wrong?

My solution:

void HostFunction(int *h_A, int *h_C, int *h_bC) {

```
int sum = 0, index;
for (int i = 0; i < 4; i++){
for (index = 0; index < i; index++) {
if (element in row of matrix (or 1D array of row-major matrix)==1) {
sum++;
}
}
}
*h_bC = sum;
```

}

**global** void Kernel(int *d_A, int *d_C, int *d_bC) {

```
int sum = 0;
int i = blockIdx.y * blockDim.y + threadIdx.y;
int index = blockIdx.x * blockDim.x + threadIdx.x;
for (int z = 0; z < 4; z++) {
if ((*(d_C + (i * 4) + z) == 1)) {
sum++;
}
}
*d_bC = sum;
printf("\nSum= %d", *d_bC);
```

}