Sum of N numbers in parallel in pairs without repetition.

local_hero · November 15, 2011, 3:16pm

Hi people, someone can help me with this problem? I have 20 numbers in an array Vect and I must sum two numbers at a time, a sum for each thread if possible, without repetitions.

Example: Vect[0,1,2,…,18,19].
Ris[ThreadID=0]=Vect[0]+Vect[1]
Ris[ThreadID=1]=Vect[0]+Vect[2]
…
Ris[ThreadID=18]=Vect[0]+Vect[19]
Ris[ThreadID=20]=Vect[1]+Vect[2]
Ris[ThreadID=21]=Vect[1]+Vect[3]
…
Ris[ThreadID=189]=Vect[18]+Vect[19]

How I can index a sum for each thread as the example above?
Thanks a lot!!!

CJS_NCSU · November 15, 2011, 4:04pm

If you’re launching multiple threads, each thread will have an effective ID of:

int effectiveLoc blockIdx.x * blockDim.x + threadIdx.x;

Now since you are summing in pairs, the effective location your thread has to calculate the first element from is twice its effective location. Thus, the 0th thread adds 0 and 1, 1st adds 2 and 3, etc.

int firstIndex = effectiveLoc * 2;

If you have an array of size N (assuming N is even for a moment), just check to make sure that either the effectiveLoc or firstIndex is within the array bounds.

The second index of the numbers you want to add is:

int secondIndex = firstIndex + 1;

If N is odd, up to you how you want to handle it.

This is the basis of parallel reduction (a term I did not come up with, not sure if it is an industry standard) but in essence, you could repeat the process above to fully sum an entire array (although it is not very efficient once you have 2 values left).

local_hero · November 15, 2011, 8:22pm

But I don’t have adds 0 and 1, 1st adds 2 and 3, etc. I must adds 0 and 1, 0 and 2, 0 and 3,…, 0 and 19, 1 and 2, 1 and 3,…,1 and 19, 2 and 3, 2 and 4,…,2 and 19,…,17 and 18, 17 and 19, 18 and 19. All the combinations without repetition.

CJS_NCSU · November 16, 2011, 5:01am

Ah, sorry about that! Seems you are adding pairs in a triangular manner, not a linear nor square manner.

Since it’s simple to calculate the number of summations you are requiring, you can check the thread’s effective location (as I mentioned earlier) to make sure it should be summing anything at all. If so, you know the first 19 summations are with 0 and something greater than 0, so you can use the thread’s effective location as the index of the something that you want to add up. Then, you know the next 18 summations are with 1 and something greater than 1, so you can use the thread’s effective location there (after resolving offsets) to calculate the index of that next something you want to add.

Hopefully this is helpful!

local_hero · November 16, 2011, 8:54am

Something like this?

global void sumVect(QVECT *Vett, QVECT *VettRis){

int tid; //Thread ID



//Calcolo del thread ID

tid=threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.y * blockDim.x;



VettRis[tid]=Vett[?]+Vett[?];

}

But I do not know what to put in place of “?” for index:

VetRis[tid=0]=Vett[0]+Vett[1]

VetRis[tid=1]=Vett[0]+Vett[2]

…

VetRis[tid=187]=Vett[17]+Vett[18]

VetRis[tid=188]=Vett[17]+Vett[19]

VetRis[tid=189]=Vett[18]+Vett[19]

I do not know how to do and everything I need for my university degree’s thesis :(.

cricri1 · November 16, 2011, 12:01pm

VettRis[tid]=Vett[?]+Vett[?];

if tit = 0 to 189
firt ? =
not real code cuda but the idea is here:)
v=0
for i= 19 to 0 step -1
v=v+i
if v>tit exit for
next i
?=19-i

second ?
?=20-i+ tit -(v-i)

local_hero · November 16, 2011, 2:01pm

You are a GENIUS and I will be grateful for life. THANK YOU!!! :)

djmj1000 · November 18, 2011, 2:23am

if thats for your thesis you should have a look at triangular numbers!

you can output the sums linear and use a 1D array and so one thread can calculate the element at its index

algorithm:

you have n numbers in your input array which lead to n*(n-1)*0.5 output combinations (see triangular formular with 0-based indices)

example n=4:

nOut = 6

direct:

tIdx: 012345

val1: 000112

val2: 123233

or flipped which i calculated in my thesis:

val1: 122333 (row-index)

val2: 001012 (column-index)

so all you have to do is:

calculate the row and column indices of the linearized output triangular matrix in your kernel

(if you have one indices you easily have the other one)

and look with those indices into your array and output the value at the unique thread index

with that unique thread index this algorithm is 100% scalable across all gpu’s and achieves 100% occupancy, so you can calculate it for endless n

I used this approach for a different pairwise triangular matrix problem in my thesis. cluster-algorithm, see page 6

How to calculate row indices

sequence of row indices is:

1223334444 …

formula

row = floor(0.5f + sqrt(0.25f - 2 * (1 - linIdx))); (for 1 based indices)

because of computer precisions in floating point numbers this formula cannot be applied directly to calculate the row, the calculated row needs to be checked up and downwards to be sure

/*

Example: a 0-based index like 496 can result in wrong results

	

row = floor(0.5f + sqrt(0.25f + 2 * 496));

 = floor(0.5f + sqrt(0.25f + 992));

 = floor(0.5f + sqrt(992.25f));

 = floor(0.5f + 31.5f); or floor(0.5f + 31.49f); <-- if its 31.49 due to precision its rounded down!

 = floor(32.0f); or floor(31.99f) <-- can be 31 instead which is wrong! so we need to check +-1

 = 32; or 31;

*/

(calculating this is equally fast as a lookup into global memory on a G80, fermi not tested)

With the row indices the corresponding column indices can easily be calculated.

I needed my row indices on cpu side so i calculated them with multiple cpu-threads so i only needed to check up and down once per thread and the rest of the numbers one cpu-thread is responsible of were calculated in a loop.

Easy Alternative:

use a 2D-Block layout so you dont need to calculate the row and column indices, but half of your threads wont be active since your matrix is a triangular matrix or they calculate with repetition. I assume that is what you already did from your thread title.

If you need more help to calculate the row indices i can post my code which would be almost equal to your complete necessary kernel^^

local_hero · November 18, 2011, 8:48am

if thats for your thesis you should have a look at triangular numbers!

you can output the sums linear and use a 1D array and so one thread can calculate the element at its index

algorithm:

you have n numbers in your input array which lead to n*(n-1)*0.5 output combinations (see triangular formular with 0-based indices)

example n=4:

nOut = 6

direct:

tIdx: 012345

val1: 000112

val2: 123233

or flipped which i calculated in my thesis:

val1: 122333 (row-index)

val2: 001012 (column-index)

so all you have to do is:

calculate the row and column indices of the linearized output triangular matrix in your kernel

(if you have one indices you easily have the other one)

and look with those indices into your array and output the value at the unique thread index

with that unique thread index this algorithm is 100% scalable across all gpu’s and achieves 100% occupancy, so you can calculate it for endless n

I used this approach for a different pairwise triangular matrix problem in my thesis. cluster-algorithm, see page 6

How to calculate row indices

sequence of row indices is:

1223334444 …

formula

row = floor(0.5f + sqrt(0.25f - 2 * (1 - linIdx))); (for 1 based indices)

because of computer precisions in floating point numbers this formula cannot be applied directly to calculate the row, the calculated row needs to be checked up and downwards to be sure
/*

Example: a 0-based index like 496 can result in wrong results

	

row = floor(0.5f + sqrt(0.25f + 2 * 496));

 = floor(0.5f + sqrt(0.25f + 992));

 = floor(0.5f + sqrt(992.25f));

 = floor(0.5f + 31.5f); or floor(0.5f + 31.49f); <-- if its 31.49 due to precision its rounded down!

 = floor(32.0f); or floor(31.99f) <-- can be 31 instead which is wrong! so we need to check +-1

 = 32; or 31;

*/
(calculating this is equally fast as a lookup into global memory on a G80, fermi not tested)

With the row indices the corresponding column indices can easily be calculated.

I needed my row indices on cpu side so i calculated them with multiple cpu-threads so i only needed to check up and down once per thread and the rest of the numbers one cpu-thread is responsible of were calculated in a loop.

Easy Alternative:

use a 2D-Block layout so you dont need to calculate the row and column indices, but half of your threads wont be active since your matrix is a triangular matrix or they calculate with repetition. I assume that is what you already did from your thread title.

If you need more help to calculate the row indices i can post my code which would be almost equal to your complete necessary kernel^^

Thank you! You have been very thorough. The trouble is that I’m new of CUDA, I have started the last week, then, if isn’t a problem, could you post the code? Thanks in advance. :)

cricri1 · November 18, 2011, 12:20pm

so tit 0 to 189
linIdx=190- tit

row = int(20-0.500005f - sqrt(0.25f - 2 * (1 - linIdx)));

with 0.500005 it s always good

local_hero · November 18, 2011, 2:01pm

Sorry but I don’t understand…is row the index of Vett[?] or of a matrix???

djmj1000 · November 18, 2011, 2:57pm

you 100% sure? with a bruteforce for all n?

i played with different values around too but I have no idea how this equation has been created so i dont change anything from it, i would ask the people from the website who created this equation.

Until 496 its always good even with 0.5!

I had no time to wait for an answer so i just checked it up and downwards.

wrong!

you missed the idea, you dont go from 0-n! you go from 0 - [n * (n-1) * 0.5] = m

since you have m pairs you need to calculate m corresponding rows and columns!

you need pairwise pairs from your input array which leads into a triangular matrix!

this triangular matrix has row and columns corresponding to their linear indices of the input array

just draw it on paper its pretty straightforward!

you have luck i have this image from my thesis, where idx is the thread index!

hint: if you have the row and the corresponding index you also have the column!

Why Cuda

anyway why do you use cuda if you only need 200 elements? which leads to 19900 pairwise elements (threads), i doubt this improves performance significant compared to a multithreaded cpu-version

cricri1 · November 18, 2011, 3:59pm

VetRis[tid=0]=Vett[i]+Vett[j]
VetRis[tid=1]=Vett[i]+Vett[j]

linIdx=190- tit

i = int(20-0.500005f - sqrt(0.25f - 2 * (1 - linIdx)));
j= tit-(20+19-i)/2*i+1+i

r= sqrt(0.25f - 2 * (1 - linIdx))
with 0.500005 it s not always good but since r(0)-r(1)>0.00005 it s good

local_hero · November 21, 2011, 12:14pm

But I have missed something in the code or does’t work??? With N=20, me the error at computation 53, 109, 110, 111, 132, 133, 134, 149, 150, 151, 152, 153, 163, 164, 165, 166, 167, 168, 175, 176, 177, 178, 179, 184, 185, 186, 189. Why?

cricri1 · November 21, 2011, 1:17pm

for me it s good
i must test in cuda on evening if i have time

local_hero · November 21, 2011, 1:30pm

My code:

global void sumVect(QVECT *Vett, QVECT *VettRis, int N){

int tid; //Thread ID



//Calcolo del thread ID

tid=threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.y * blockDim.x;





int linIdx=190 - tid;

int i = int(20 - 0.500005f - sqrt(0.25f - 2 * (1 - linIdx)));

    int j= tid - (20+19-i)/2*i+1+i;



//Somma di due quadrivettori

VettRis[tid].x=Vett[i].x+Vett[j].x;

VettRis[tid].y=Vett[i].y+Vett[j].y;

VettRis[tid].z=Vett[i].z+Vett[j].z;

VettRis[tid].Ene=Vett[i].Ene+Vett[j].Ene;

}

Have I missed anything???

cricri1 · November 21, 2011, 7:07pm

work External Image

#include <stdio.h>
#include <cuda.h>
#include <time.h>
#include <math.h>
#include “cutil_inline.h”
global void square_array(int *a,int b,int N,int N2)
{
int sh= threadIdx.x+32threadIdx.y;
int tid = 512 * blockIdx.x + 1048576 * blockIdx.y + sh;
if ( tid<N2)
{
int linIdx=N2 - tid;

int i = int(N - 0.5 - sqrt(0.25 - 2 * (1 - linIdx)));

int z =(N+N-1-i)*i;
int j= tid - z/2 + 1 + i;

if (i==j)
{
i=i-1;
j=N-1;
}
a[tid]=i;
b[tid]=j;

}

// main routine that executes on the host
int main(void)
{
int *memoiregraphique1; // Pointer to host & device arrays
int *memoiregraphique2; // Pointer to host & device arrays
int *memoirecpu1; // Pointer to host & device arrays
int *memoirecpu2; // Pointer to host & device arrays

int N,N2;

N=20;
N2=N*(N-1)/2;
size_t size = N2 * sizeof(int);

memoirecpu1 = (int *)malloc(size); // Allocate array on host
memoirecpu2 = (int *)malloc(size); // Allocate array on host

cutilSafeCall( cudaThreadSynchronize() );
cudaMalloc((void **) &memoiregraphique1, size);   // Allocate array on device  
cudaMalloc((void **) &memoiregraphique2, size);   // Allocate array on device  


square_array <<< dim3(2048,32,1),dim3(32,16,1)   >>> (memoiregraphique1, memoiregraphique2,N,N2);

cudaMemcpy(memoirecpu1, memoiregraphique1, size, cudaMemcpyDeviceToHost);
cudaMemcpy(memoirecpu2, memoiregraphique2, size, cudaMemcpyDeviceToHost);

for (int i=0; i<N2*2; i=i+1)
{
printf(“%d %d %d\n”, i, memoirecpu1[i],memoirecpu2[i]);

}

//------------------------------

free(memoirecpu1); cudaFree(memoiregraphique1);
free(memoirecpu2); cudaFree(memoiregraphique2);

}

local_hero · November 21, 2011, 8:08pm

Tomorrow I try, now I haven’t a CUDA system. External Image

local_hero · November 22, 2011, 8:57am

Great! Works!!! You are a great!! :D Thank you very much!!! :)

local_hero · November 22, 2011, 9:26am

Sorry my ignorance, but how did you to find this solution? For me is ingenious! For example, how did you to find the number 1048576 and the other values? It’s just to understand :)

Topic		Replies	Views
Iteration help in CUDA CUDA Programming and Performance	11	6878	April 19, 2012
Sum vectors CUDA Programming and Performance	24	5000	December 21, 2011
Reduction CUDA Programming and Performance	19	3463	May 16, 2012
Timing comparison(ms) in calculation of the sum of matrix rows CUDA Programming and Performance cuda , kernel	1	463	October 26, 2022
Urgent help with threads please! CUDA Programming and Performance	21	10786	March 6, 2008
matrix multiply reduction CUDA Programming and Performance	41	35543	January 15, 2011
How to add pointer array value CUDA Programming and Performance	13	1732	May 2, 2019
Double loop, sum on j for each i. atomicAdd()? CUDA Programming and Performance	20	7020	March 25, 2012
Efficient use of shared memory CUDA Programming and Performance	29	4403	December 2, 2019
Writing a kernel efficiently uneven indexing for loop CUDA Programming and Performance	14	6491	September 8, 2010

Sum of N numbers in parallel in pairs without repetition.

Related topics