Hello,
Currently, I am working on writing a program that can be called for calculating the dotproduct of various neuron in a layer of a neural network.
As part of this, I am calling a dot product function and assigning 1 block for each neuron in the next layer. As my hardware (jetson nano developer kit) only has 1 SM, I am aware that 8 blocks can be run simultaneously and as such I have assigned 128 threads per block.
My question is basically, if I am striding through this dot product calculation in steps of 128 to ensure that each thread doesn’t repeat the calculation, how can I know when these 128 threads have finished so I can pass the final value into a shared variable ?
I assume that I can’t just assume that the 128th thread will complete last as which thread will finish first will depend on the multiplication times. Also I can’t just run a shared variable to sum up the thread indexes in this function and update after all threads have been run, since there are multiple blocks calling this function in parallel.
I’ll attach the program below, apologies if this this question is simple or I have misunderstood something, I am new to CUDA and GPU programming in general.
device void device_dotProduct(double* list1, double* list2, double* _value, double* _target, int arrayLength){
int idx = threadIdx.x;
int stride = 128;
double target = 0.0;
for (int i = idx; i < arrayLength; i+=128){
target += list1[i]*list2[i];
}
*_target = target;
_value[blockIdx.x] = target;
__syncthreads();