Sum of N numbers in parallel in pairs without repetition.

Hi people, someone can help me with this problem? I have 20 numbers in an array Vect and I must sum two numbers at a time, a sum for each thread if possible, without repetitions.

Example: Vect[0,1,2,…,18,19].



How I can index a sum for each thread as the example above?
Thanks a lot!!!

If you’re launching multiple threads, each thread will have an effective ID of:

int effectiveLoc blockIdx.x * blockDim.x + threadIdx.x;

Now since you are summing in pairs, the effective location your thread has to calculate the first element from is twice its effective location. Thus, the 0th thread adds 0 and 1, 1st adds 2 and 3, etc.

int firstIndex = effectiveLoc * 2;

If you have an array of size N (assuming N is even for a moment), just check to make sure that either the effectiveLoc or firstIndex is within the array bounds.

The second index of the numbers you want to add is:

int secondIndex = firstIndex + 1;

If N is odd, up to you how you want to handle it.

This is the basis of parallel reduction (a term I did not come up with, not sure if it is an industry standard) but in essence, you could repeat the process above to fully sum an entire array (although it is not very efficient once you have 2 values left).

But I don’t have adds 0 and 1, 1st adds 2 and 3, etc. I must adds 0 and 1, 0 and 2, 0 and 3,…, 0 and 19, 1 and 2, 1 and 3,…,1 and 19, 2 and 3, 2 and 4,…,2 and 19,…,17 and 18, 17 and 19, 18 and 19. All the combinations without repetition.

Ah, sorry about that! Seems you are adding pairs in a triangular manner, not a linear nor square manner.

Since it’s simple to calculate the number of summations you are requiring, you can check the thread’s effective location (as I mentioned earlier) to make sure it should be summing anything at all. If so, you know the first 19 summations are with 0 and something greater than 0, so you can use the thread’s effective location as the index of the something that you want to add up. Then, you know the next 18 summations are with 1 and something greater than 1, so you can use the thread’s effective location there (after resolving offsets) to calculate the index of that next something you want to add.

Hopefully this is helpful!

Something like this?

global void sumVect(QVECT *Vett, QVECT *VettRis){

int tid; //Thread ID

//Calcolo del thread ID

tid=threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.y * blockDim.x;



But I do not know what to put in place of “?” for index:






I do not know how to do and everything I need for my university degree’s thesis :(.


if tit = 0 to 189
firt ? =
not real code cuda but the idea is here:)
for i= 19 to 0 step -1
if v>tit exit for
next i

second ?
?=20-i+ tit -(v-i)

You are a GENIUS and I will be grateful for life. THANK YOU!!! :)

if thats for your thesis you should have a look at triangular numbers!

you can output the sums linear and use a 1D array and so one thread can calculate the element at its index


you have n numbers in your input array which lead to n*(n-1)*0.5 output combinations (see triangular formular with 0-based indices)

example n=4:

nOut = 6


tIdx: 012345

val1: 000112

val2: 123233

or flipped which i calculated in my thesis:

val1: 122333 (row-index)

val2: 001012 (column-index)

so all you have to do is:

calculate the row and column indices of the linearized output triangular matrix in your kernel

(if you have one indices you easily have the other one)

and look with those indices into your array and output the value at the unique thread index

with that unique thread index this algorithm is 100% scalable across all gpu’s and achieves 100% occupancy, so you can calculate it for endless n

I used this approach for a different pairwise triangular matrix problem in my thesis. cluster-algorithm, see page 6

How to calculate row indices

sequence of row indices is:

1223334444 …


row = floor(0.5f + sqrt(0.25f - 2 * (1 - linIdx))); (for 1 based indices)

because of computer precisions in floating point numbers this formula cannot be applied directly to calculate the row, the calculated row needs to be checked up and downwards to be sure


Example: a 0-based index like 496 can result in wrong results


row = floor(0.5f + sqrt(0.25f + 2 * 496));

 = floor(0.5f + sqrt(0.25f + 992));

 = floor(0.5f + sqrt(992.25f));

 = floor(0.5f + 31.5f); or floor(0.5f + 31.49f); <-- if its 31.49 due to precision its rounded down!

 = floor(32.0f); or floor(31.99f) <-- can be 31 instead which is wrong! so we need to check +-1

 = 32; or 31;


(calculating this is equally fast as a lookup into global memory on a G80, fermi not tested)

With the row indices the corresponding column indices can easily be calculated.

I needed my row indices on cpu side so i calculated them with multiple cpu-threads so i only needed to check up and down once per thread and the rest of the numbers one cpu-thread is responsible of were calculated in a loop.

Easy Alternative:

use a 2D-Block layout so you dont need to calculate the row and column indices, but half of your threads wont be active since your matrix is a triangular matrix or they calculate with repetition. I assume that is what you already did from your thread title.

If you need more help to calculate the row indices i can post my code which would be almost equal to your complete necessary kernel^^

Thank you! You have been very thorough. The trouble is that I’m new of CUDA, I have started the last week, then, if isn’t a problem, could you post the code? Thanks in advance. :)

so tit 0 to 189
linIdx=190- tit

row = int(20-0.500005f - sqrt(0.25f - 2 * (1 - linIdx)));

with 0.500005 it s always good

Sorry but I don’t understand…is row the index of Vett[?] or of a matrix???

you 100% sure? with a bruteforce for all n?

i played with different values around too but I have no idea how this equation has been created so i dont change anything from it, i would ask the people from the website who created this equation.

Until 496 its always good even with 0.5!

I had no time to wait for an answer so i just checked it up and downwards.


you missed the idea, you dont go from 0-n! you go from 0 - [n * (n-1) * 0.5] = m

since you have m pairs you need to calculate m corresponding rows and columns!

you need pairwise pairs from your input array which leads into a triangular matrix!

this triangular matrix has row and columns corresponding to their linear indices of the input array

just draw it on paper its pretty straightforward!

you have luck i have this image from my thesis, where idx is the thread index!

hint: if you have the row and the corresponding index you also have the column!

Why Cuda

anyway why do you use cuda if you only need 200 elements? which leads to 19900 pairwise elements (threads), i doubt this improves performance significant compared to a multithreaded cpu-version


linIdx=190- tit

i = int(20-0.500005f - sqrt(0.25f - 2 * (1 - linIdx)));
j= tit-(20+19-i)/2*i+1+i

r= sqrt(0.25f - 2 * (1 - linIdx))
with 0.500005 it s not always good but since r(0)-r(1)>0.00005 it s good

But I have missed something in the code or does’t work??? With N=20, me the error at computation 53, 109, 110, 111, 132, 133, 134, 149, 150, 151, 152, 153, 163, 164, 165, 166, 167, 168, 175, 176, 177, 178, 179, 184, 185, 186, 189. Why?

for me it s good
i must test in cuda on evening if i have time

My code:

global void sumVect(QVECT *Vett, QVECT *VettRis, int N){

int tid; //Thread ID

//Calcolo del thread ID

tid=threadIdx.x + threadIdx.y * blockDim.x + threadIdx.z * blockDim.y * blockDim.x;

int linIdx=190 - tid;

int i = int(20 - 0.500005f - sqrt(0.25f - 2 * (1 - linIdx)));

    int j= tid - (20+19-i)/2*i+1+i;

//Somma di due quadrivettori






Have I missed anything???

work External Image

#include <stdio.h>
#include <cuda.h>
#include <time.h>
#include <math.h>
#include “cutil_inline.h”
global void square_array(int *a,int b,int N,int N2)
int sh= threadIdx.x+32
int tid = 512 * blockIdx.x + 1048576 * blockIdx.y + sh;
if ( tid<N2)
int linIdx=N2 - tid;

int i = int(N - 0.5 - sqrt(0.25 - 2 * (1 - linIdx)));

int z =(N+N-1-i)*i;
int j= tid - z/2 + 1 + i;

if (i==j)



// main routine that executes on the host
int main(void)
int *memoiregraphique1; // Pointer to host & device arrays
int *memoiregraphique2; // Pointer to host & device arrays
int *memoirecpu1; // Pointer to host & device arrays
int *memoirecpu2; // Pointer to host & device arrays

int N,N2;

size_t size = N2 * sizeof(int);

memoirecpu1 = (int *)malloc(size); // Allocate array on host
memoirecpu2 = (int *)malloc(size); // Allocate array on host

cutilSafeCall( cudaThreadSynchronize() );
cudaMalloc((void **) &memoiregraphique1, size);   // Allocate array on device  
cudaMalloc((void **) &memoiregraphique2, size);   // Allocate array on device  

square_array <<< dim3(2048,32,1),dim3(32,16,1)   >>> (memoiregraphique1, memoiregraphique2,N,N2);  

cudaMemcpy(memoirecpu1, memoiregraphique1, size, cudaMemcpyDeviceToHost);
cudaMemcpy(memoirecpu2, memoiregraphique2, size, cudaMemcpyDeviceToHost);

for (int i=0; i<N2*2; i=i+1)
printf(“%d %d %d\n”, i, memoirecpu1[i],memoirecpu2[i]);



free(memoirecpu1); cudaFree(memoiregraphique1);
free(memoirecpu2); cudaFree(memoiregraphique2);


Tomorrow I try, now I haven’t a CUDA system. External Image

Great! Works!!! You are a great!! :D Thank you very much!!! :)

Sorry my ignorance, but how did you to find this solution? For me is ingenious! For example, how did you to find the number 1048576 and the other values? It’s just to understand :)