Hi,
This is a sub-problem of a larger problem, but I am stuck on it for considerably large time and not able to get the right output. Please help in correcting my mistakes and suggesting right solution.
Problem
Assume a square matrix, as input
Square Matrix
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Output Matrix
5
9 10
13 14 15
Output sum= 66
I am trying to print out the sum of elements of sqaure Matrix, below the diognal (5, 9, 10, 13, 14, 15), but somehow the not able to do so.First I converted Matrix into array[row-major matrix]. to write the function for the host, but somehow when making threads(kernel), the result is not correct.
Please indicate what I am doing wrong?
My solution:
void HostFunction(int *h_A, int *h_C, int *h_bC) {
int sum = 0, index;
for (int i = 0; i < 4; i++){
for (index = 0; index < i; index++) {
if (element in row of matrix (or 1D array of row-major matrix)==1) {
sum++;
}
}
}
*h_bC = sum;
}
global void Kernel(int *d_A, int *d_C, int *d_bC) {
int sum = 0;
int i = blockIdx.y * blockDim.y + threadIdx.y;
int index = blockIdx.x * blockDim.x + threadIdx.x;
for (int z = 0; z < 4; z++) {
if ((*(d_C + (i * 4) + z) == 1)) {
sum++;
}
}
*d_bC = sum;
printf("\nSum= %d", *d_bC);
}
when you call Kernel<x,y>(…), cuda creates x*y threads, each running independently. so each thread counts sum of 4 elements (looping only through z) and prints sum of these 4 elements
also, you can use CODE tag (last icon in toolbar above edit box) to format code you are posing here
So, in that case, please suggest a way, where in I have to sum-up elements from different threads. As in my example, I have to sum up: 0 elements from 0th row (handled by 1 thread), then first element from 1st row, then first-2 elements from 2nd row, and so on.
How would I be able to implement this kind of simple matrix-like behaviour, which is not row or column specific.
i will show you simpler pseudo-code that sums up all elements in vector. you can find such example in any book decicated to learn CUDA:
int arr[10];
int sum;
Kernel<<<1,10>>>(arr,&sum)
Kernel(int* arr, int* sum) {
// thread index, 0..9
int i = threadIdx.x;
// first thread inits the sum
if (i==0) *sum = 0;
// other threads wait the first thread to finish initialization
__syncthreads();
// then each thread adds its own value to the pool
atomic_add(*sum, arr[i]);
// now the first thread waits untill all threads will finish the previous statement
__syncthreads();
// first thread prints the result
if (i==0) printf("%d",*sum);
}
as you see, you need to use atomic operations to work on shared variable, and thread barrier (__syncthreads) to synchronize execution. You can use the same technique for multi-threaded CPU programming. If you aren’t familiar with this technique, you can learn it from any CUDA book
I have not gone through atomic yet, but now that I see the utility of it being used in my program, let me try using it and get back.
Thanks.
Please also tell/advice(tell me what to read in a CUDA book) me ways in cuda, where I can implement following behaviours:
- Nested functions on device side.
- Copy an array to another, on device side. I could not achieve it using cudaMemcpyDevicetoDevice.
Without the above, I have to constantly transfer arrays and variables between host and device, which is a the cause for lower performance.