Unable to access the correct matrix elements through threads


This is a sub-problem of a larger problem, but I am stuck on it for considerably large time and not able to get the right output. Please help in correcting my mistakes and suggesting right solution.

Assume a square matrix, as input

Square Matrix

1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16

Output Matrix

9 10
13 14 15

Output sum= 66

I am trying to print out the sum of elements of sqaure Matrix, below the diognal (5, 9, 10, 13, 14, 15), but somehow the not able to do so.First I converted Matrix into array[row-major matrix]. to write the function for the host, but somehow when making threads(kernel), the result is not correct.
Please indicate what I am doing wrong?

My solution:

void HostFunction(int *h_A, int *h_C, int *h_bC) {

	int sum = 0, index;
	for (int i = 0; i < 4; i++){
		for (index = 0; index < i; index++) {
			if (element in row of matrix (or 1D array of row-major matrix)==1) {


	*h_bC = sum;


global void Kernel(int *d_A, int *d_C, int *d_bC) {

int sum = 0;
int i = blockIdx.y * blockDim.y + threadIdx.y;
int index = blockIdx.x * blockDim.x + threadIdx.x;

for (int z = 0; z < 4; z++) {
	if ((*(d_C + (i * 4) + z) == 1)) {

*d_bC = sum;
printf("\nSum= %d", *d_bC);


when you call Kernel<x,y>(…), cuda creates x*y threads, each running independently. so each thread counts sum of 4 elements (looping only through z) and prints sum of these 4 elements

also, you can use CODE tag (last icon in toolbar above edit box) to format code you are posing here

So, in that case, please suggest a way, where in I have to sum-up elements from different threads. As in my example, I have to sum up: 0 elements from 0th row (handled by 1 thread), then first element from 1st row, then first-2 elements from 2nd row, and so on.

How would I be able to implement this kind of simple matrix-like behaviour, which is not row or column specific.

i will show you simpler pseudo-code that sums up all elements in vector. you can find such example in any book decicated to learn CUDA:

int arr[10];
int sum;

Kernel(int* arr, int* sum) {
  // thread index, 0..9
  int i = threadIdx.x;

  // first thread inits the sum
  if (i==0) *sum = 0;

  // other threads wait the first thread to finish initialization

  // then each thread adds its own value to the pool
  atomic_add(*sum, arr[i]);

  // now the first thread waits untill all threads will finish the previous statement

  // first thread prints the result
  if (i==0) printf("%d",*sum);

as you see, you need to use atomic operations to work on shared variable, and thread barrier (__syncthreads) to synchronize execution. You can use the same technique for multi-threaded CPU programming. If you aren’t familiar with this technique, you can learn it from any CUDA book

I have not gone through atomic yet, but now that I see the utility of it being used in my program, let me try using it and get back.

Please also tell/advice(tell me what to read in a CUDA book) me ways in cuda, where I can implement following behaviours:

  1. Nested functions on device side.
  2. Copy an array to another, on device side. I could not achieve it using cudaMemcpyDevicetoDevice.

Without the above, I have to constantly transfer arrays and variables between host and device, which is a the cause for lower performance.