Implementing Striding to sum to a single gpu variable in a block

Hello,

I am currently trying to implement a striding dot product function. I am attempting to have this function update a gpu variable so that the resulting sum from the dot product is all stored in a single variable at the end of the striding.

Currently, the first thing I do is to reset the value at the gpu_variable to 0 using a separate Kernel, I then wait for the threads to sync before moving on the the dot product calculation.

I will include the code that works for when I run the block on a single thread as well as the code used when running it on multiple threads which does not work.

In both cases the kernel has the following inputs:

device void device_dotProduct(double* list1, double* list2, double* _value, double* _target, int arrayLength)

Code for single thread ( working )

int idx = threadIdx.x;
int stride = 1;

double target = 0.0;
for (int i = 0; i < arrayLength; i+=1){
    target += list1[i]*list2[i];
}

*_target = target;
__syncthreads();

not working multithread code

int idx = threadIdx.x;
double target = 0.0;
for (int i = idx; i < arrayLength; i+=128){
target+=list1[i]*list2[i];
}
double sum = *_target + target;
double current_target = *_target;
__syncthreads();
*_target = current_target + target;
__syncthreads();

A parallel reduction is a fundamental algorithm. There are many materials on how to do this, and how to do this efficiently.

For example: Faster Parallel Reductions on Kepler | NVIDIA Technical Blog

1 Like

Thank you.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.