 # Learning by coding recursive sum using dynamic parallelism

Hello all,

I am trying to do a parallel sum using dynamic parallelism. I know it is very inefficient. I’m doing it to learn.

Basically, I am trying to replicate the behavior of the following sequential C code but in CUDA.

``````double rec_sum(double * v, int n){

if (n <=1)
return v;

double sum_left  = rec_sum(v,n/2);
double sum_right = rec_sum(&v[n/2],n - (n/2));

return sum_left + sum_right;

}
``````

However, I cannot get it to work. Any help? Maybe something to do with syncs, shared memory and those sort of things … !?

Here is my FULL CUDA code, which I cam compiling with
nvcc -arch=sm_35 -dc recursive_sum.cu
I am working with a TeslaK40.

The code now compiles and works but produces the wrong output.
3.380729 (correct) vs 4.165090 (wrong)

``````#include <stdio.h>

double rec_sum(double * v, int n){

if (n <=1)
return v;

double sum_left  = rec_sum(v,n/2);
double sum_right = rec_sum(&v[n/2],n - (n/2));

return sum_left + sum_right;

}

__global__ void kernel_rec_sum_global_mem(double * v, int n){

if (n <=1){
v = v + v;
}
return;
}else{

kernel_rec_sum_global_mem<<<1,2>>>(v      , n/2       );
}
kernel_rec_sum_global_mem<<<1,2>>>(&v[n/2], n - (n/2));
}

cudaDeviceSynchronize(); //makes sure that all the sub kernels have finished as far as thread 0 is concerned

}
__syncthreads();// makes sure global memory is consistent

v = v + v[n/2];
}
}

}

int main(){

int n = 2<<3;
double *v = (double *) malloc(n * sizeof(double));

for (int  i = 0; i < n; i++){
v[i] = 1.0/(1.0 + ((double )i ));
}

double total;

// test recursive sum

total = rec_sum(v, n);
printf("%f \n", total);

// test recursive sum with dynamic parallelism

cudaSetDevice( 0 );

double *v_dev;
double *sum_dev;
cudaMalloc((void**)&v_dev, n*sizeof(double));
cudaMalloc((void**)&sum_dev, 1*sizeof(double));
cudaMemcpy(v_dev, v, n*sizeof(double), cudaMemcpyHostToDevice);
kernel_rec_sum_global_mem<<<1,2>>>(v_dev, n);
total = 0;
cudaMemcpy(&total, v_dev, 1*sizeof(double), cudaMemcpyDeviceToHost);

printf("%f \n", total);

cudaFree(v_dev);
cudaFree(sum_dev);

return 0;

}
``````

The first FULL CUDA code of my post had an obvious error. I was passing shared memory into a kernel argument, which is not allowed.

``````#include <stdio.h>

double rec_sum(double * v, int n){

if (n <=1)
return v;

double sum_left  = rec_sum(v,n/2);
double sum_right = rec_sum(&v[n/2],n - (n/2));

return sum_left + sum_right;

}

__global__ void kernel_rec_sum(double* sum, double * v, int n){

if (n <=1){
*sum = v;
return;
}

__shared__  double sum_left;
__shared__  double sum_right;

kernel_rec_sum<<<1,2>>>(&sum_left,  v      , n/2      );
}
kernel_rec_sum<<<1,2>>>(&sum_right, &v[n/2], n - (n/2) );
}

cudaDeviceSynchronize(); //makes sure that all the sub kernels have finished as far as thread 0 is concerned

}
__syncthreads();// makes sure shared memory is coordinated

*sum = sum_left + sum_right;
}

}

int main(){

int n = 2<<3;
double *v = (double *) malloc(n * sizeof(double));

for (int  i = 0; i < n; i++){
v[i] = 1.0/(1.0 + ((double )i ));
}

double total;

// test recursive sum

total = rec_sum(v, n);
cend(&cputime);
printf("%f , %f\n", total,cputime);

// test recursive sum with dynamic parallelism

cudaSetDevice( 0 );

double *v_dev;
double *sum_dev;
cudaMalloc((void**)&v_dev, n*sizeof(double));
cudaMalloc((void**)&sum_dev, 1*sizeof(double));
cudaMemcpy(v_dev, v, n*sizeof(double), cudaMemcpyHostToDevice);
kernel_rec_sum<<<1,2>>>(sum_dev, v_dev, n);
total = 0;
cudaMemcpy(&total, sum_dev, 1*sizeof(double), cudaMemcpyDeviceToHost);

printf("%f \n", total);

cudaFree(v_dev);
cudaFree(sum_dev);

return 0;

}
``````

https://stackoverflow.com/questions/47956167/synchronization-in-cuda-dynamic-parallelism

In particular:

1. any time you are having trouble with a CUDA code, you should do rigorous error checking. This includes, for dynamic parallelism, doing the same kind of error checking in device code (CUDA runtime API return codes, and proper kernel error checking) that you should do in host code.

2. CDP (CUDA Dynamic Parallelism) has, by default, severe limits on the recursion depth supported. You can override this with a runtime API call, but you cannot achieve unlimited launch/recursion depth, while also requiring synchronization.

I suggest you start by incorporating those two concepts into your code. If you’re still having trouble, post the updated code, along with any error spew that comes from running it.

By the way, your code as posted at the moment does not compile.