Looking for advice on writing a stride for multiple blocks running in parallel

lucafaccenda22 · October 16, 2024, 11:11am

Hello,

Currently, I am working on writing a program that can be called for calculating the dotproduct of various neuron in a layer of a neural network.

As part of this, I am calling a dot product function and assigning 1 block for each neuron in the next layer. As my hardware (jetson nano developer kit) only has 1 SM, I am aware that 8 blocks can be run simultaneously and as such I have assigned 128 threads per block.

My question is basically, if I am striding through this dot product calculation in steps of 128 to ensure that each thread doesn’t repeat the calculation, how can I know when these 128 threads have finished so I can pass the final value into a shared variable ?

I assume that I can’t just assume that the 128th thread will complete last as which thread will finish first will depend on the multiplication times. Also I can’t just run a shared variable to sum up the thread indexes in this function and update after all threads have been run, since there are multiple blocks calling this function in parallel.

I’ll attach the program below, apologies if this this question is simple or I have misunderstood something, I am new to CUDA and GPU programming in general.

device void device_dotProduct(double* list1, double* list2, double* _value, double* _target, int arrayLength){

int idx = threadIdx.x;
int stride = 128;

double target = 0.0;
for (int i = idx; i < arrayLength; i+=128){
    target += list1[i]*list2[i];
}

*_target = target;
_value[blockIdx.x] = target;

__syncthreads();

striker159 · October 16, 2024, 11:15am

You need to perform a parallel reduction: Faster Parallel Reductions on Kepler | NVIDIA Technical Blog

double target = 0.0;
for (int i = idx; i < arrayLength; i+=128){
    target += list1[i]*list2[i];
}

double blocksum = blockreduce(target);
if(threadIdx.x == 0)
   output[blockIdx.x] = blocksum;

lucafaccenda22 · October 16, 2024, 11:51am

Thank you, I will look into this.

Topic		Replies	Views
Implementing Striding to sum to a single gpu variable in a block CUDA Programming and Performance	3	86	November 13, 2024
Parallel Reduction CUDA Programming and Performance	2	1227	July 8, 2010
How to achieve a[1]+a[2], a[3]+a[4], a[5]+a[6] ........ using CUDA CUDA Programming and Performance	3	704	April 3, 2018
A "simple" question CUDA Programming and Performance	2	1555	October 30, 2007
vector dot product Reduction method CUDA Programming and Performance	0	2086	October 28, 2011
Reduction code for arrays at random index locations CUDA Programming and Performance	2	5510	October 14, 2011
Parallel reduction problem CUDA Programming and Performance	1	5138	November 29, 2010
simple cdot implementation 2nd hwoto improve performance CUDA Programming and Performance	0	2387	September 7, 2007
Array Sum in cuda CUDA Programming and Performance	5	11596	May 30, 2010
problem with dot product code CUDA Programming and Performance	11	6625	June 24, 2008

Looking for advice on writing a stride for multiple blocks running in parallel

Related topics