Unable to access the correct matrix elements through threads

nixcoder · May 26, 2017, 10:08pm

Hi,

This is a sub-problem of a larger problem, but I am stuck on it for considerably large time and not able to get the right output. Please help in correcting my mistakes and suggesting right solution.

Problem
Assume a square matrix, as input

Square Matrix

1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16

Output Matrix

5
9 10
13 14 15

Output sum= 66

I am trying to print out the sum of elements of sqaure Matrix, below the diognal (5, 9, 10, 13, 14, 15), but somehow the not able to do so.First I converted Matrix into array[row-major matrix]. to write the function for the host, but somehow when making threads(kernel), the result is not correct.
Please indicate what I am doing wrong?

My solution:

void HostFunction(int *h_A, int *h_C, int *h_bC) {

	int sum = 0, index;
	for (int i = 0; i < 4; i++){
		for (index = 0; index < i; index++) {
			if (element in row of matrix (or 1D array of row-major matrix)==1) {
				sum++;

			}
		}
	}

	*h_bC = sum;

}

global void Kernel(int *d_A, int *d_C, int *d_bC) {

int sum = 0;
int i = blockIdx.y * blockDim.y + threadIdx.y;
int index = blockIdx.x * blockDim.x + threadIdx.x;


for (int z = 0; z < 4; z++) {
	if ((*(d_C + (i * 4) + z) == 1)) {
		sum++;
	}
}

*d_bC = sum;
printf("\nSum= %d", *d_bC);

}

BulatZiganshin · May 27, 2017, 5:31am

when you call Kernel<x,y>(…), cuda creates x*y threads, each running independently. so each thread counts sum of 4 elements (looping only through z) and prints sum of these 4 elements

also, you can use CODE tag (last icon in toolbar above edit box) to format code you are posing here

nixcoder · May 27, 2017, 12:35pm

So, in that case, please suggest a way, where in I have to sum-up elements from different threads. As in my example, I have to sum up: 0 elements from 0th row (handled by 1 thread), then first element from 1st row, then first-2 elements from 2nd row, and so on.

How would I be able to implement this kind of simple matrix-like behaviour, which is not row or column specific.

BulatZiganshin · May 27, 2017, 2:46pm

i will show you simpler pseudo-code that sums up all elements in vector. you can find such example in any book decicated to learn CUDA:

int arr[10];
int sum;
Kernel<<<1,10>>>(arr,&sum)

Kernel(int* arr, int* sum) {
  // thread index, 0..9
  int i = threadIdx.x;

  // first thread inits the sum
  if (i==0) *sum = 0;

  // other threads wait the first thread to finish initialization
  __syncthreads();

  // then each thread adds its own value to the pool
  atomic_add(*sum, arr[i]);

  // now the first thread waits untill all threads will finish the previous statement
  __syncthreads();

  // first thread prints the result
  if (i==0) printf("%d",*sum);
}

as you see, you need to use atomic operations to work on shared variable, and thread barrier (__syncthreads) to synchronize execution. You can use the same technique for multi-threaded CPU programming. If you aren’t familiar with this technique, you can learn it from any CUDA book

nixcoder · May 27, 2017, 9:06pm

I have not gone through atomic yet, but now that I see the utility of it being used in my program, let me try using it and get back.
Thanks.

nixcoder · May 27, 2017, 10:11pm

Please also tell/advice(tell me what to read in a CUDA book) me ways in cuda, where I can implement following behaviours:

Nested functions on device side.
Copy an array to another, on device side. I could not achieve it using cudaMemcpyDevicetoDevice.

Without the above, I have to constantly transfer arrays and variables between host and device, which is a the cause for lower performance.

Topic		Replies	Views
Summing matrix elements CUDA Programming and Performance	3	6921	July 4, 2011
matrix column sums CUDA Programming and Performance	0	2058	November 25, 2008
Combining sums CUDA Programming and Performance	1	1222	November 27, 2008
Summing array elements using kernel Access frome the whole block grid CUDA Programming and Performance	3	852	July 16, 2010
CUDA - calculation of a sum CUDA Programming and Performance	7	5451	April 30, 2010
Timing comparison(ms) in calculation of the sum of matrix rows CUDA Programming and Performance cuda , kernel	1	454	October 26, 2022
sum of all elements of a matrix CUDA Programming and Performance	11	36405	October 18, 2010
Iteration help in CUDA CUDA Programming and Performance	11	6848	April 19, 2012
Calculation sum of array parts have large prime number elements CUDA Programming and Performance	5	1845	December 23, 2009
How to put specific elements from one array to another array use CUDA? CUDA Programming and Performance cuda	6	1309	October 30, 2022

Unable to access the correct matrix elements through threads

Related topics