Looking for advice on writing a stride for multiple blocks running in parallel

Hello,

Currently, I am working on writing a program that can be called for calculating the dotproduct of various neuron in a layer of a neural network.

As part of this, I am calling a dot product function and assigning 1 block for each neuron in the next layer. As my hardware (jetson nano developer kit) only has 1 SM, I am aware that 8 blocks can be run simultaneously and as such I have assigned 128 threads per block.

My question is basically, if I am striding through this dot product calculation in steps of 128 to ensure that each thread doesn’t repeat the calculation, how can I know when these 128 threads have finished so I can pass the final value into a shared variable ?

I assume that I can’t just assume that the 128th thread will complete last as which thread will finish first will depend on the multiplication times. Also I can’t just run a shared variable to sum up the thread indexes in this function and update after all threads have been run, since there are multiple blocks calling this function in parallel.

I’ll attach the program below, apologies if this this question is simple or I have misunderstood something, I am new to CUDA and GPU programming in general.

device void device_dotProduct(double* list1, double* list2, double* _value, double* _target, int arrayLength){

int idx = threadIdx.x;
int stride = 128;

double target = 0.0;
for (int i = idx; i < arrayLength; i+=128){
    target += list1[i]*list2[i];
}

*_target = target;
_value[blockIdx.x] = target;

__syncthreads();

You need to perform a parallel reduction: Faster Parallel Reductions on Kepler | NVIDIA Technical Blog

double target = 0.0;
for (int i = idx; i < arrayLength; i+=128){
    target += list1[i]*list2[i];
}

double blocksum = blockreduce(target);
if(threadIdx.x == 0)
   output[blockIdx.x] = blocksum;
1 Like

Thank you, I will look into this.