Visiting every combination of three elements in CUDA?


Some pseudo code to do a calc on every combination of three elements from an array. I am wondering how to do this in CUDA?

N = myArray.size();
for(int i=0; i<N; i++)
for(int j=i+1; j<N; j++)
for(int k=j+1; k<N; k++)
doCalc(myArray[i], myArray[j], myArray[k])

Assuming N (myArray size) varies from 21-60 and I launch the same kernel for each new myArray I throw at it. Note that doCalc is not a heavy calculation. I am using a Titan (Kepler) card. What is the CUDA code to do this?

Thanks for any help,

P.S. I couldn’t find an example code for 3 nested loops. I’m not sure to use 2D grids, 3D grids, 3D blocks etc.

That depends how complicated doCalc is ( how long it will take to run). If its quick then I would have each thread contain the loop on k.

total number of threads you need is then aprox NN/2 ( i.e. the i and j loops)
pick a blocksize that is a multiple of 32, say 128 (128 threads per block)
number of blocks in the grid is then (N
N/2)/128 rounded up

Each thread will need to calculate its i and j from its block number and thread number within its block.


Because of the nature of you loops yo will need to launch N^3 threads. This is not a big issues, since the threads with j<i and k<j do nothing. In order to launch N^3 you need to have number of blocks x number of threads per block >= N^3. For N< 1024 I use very simple grid: < < < dim3(N,N),N > > > This would launch a N*N blocks with N threads per block.
The indexed are obtained in the kernel as: i=threadIdx.x,j=blockIdx.x,k=blockIdx.y and you can have a if statement. You must have N<=1024.

If N> 1024 then you can use < < < dim3((N+tbp-1,N,N),tbp > > > with only change in i=threadIdx.x+blockIdx.x*blockDim.x.

This is good case becasue the blocks which do nothing will be very fast discarded.
There are other possible combinations. The most general one would be < < < dim3(b,
b,b),dim3(t,t,t) > > > with bt>=N. In this case you can have: i=threaIdx.x+blockIdx.xblockDim.x; and similar for j and k.

These example are not optimal because of the large amount of memory transfers. For the 2 D case there is a N-body problem in which the shared memory is used to reduce the memory transfers. In this case they only have pairs (i,j), but maybe you can extend this to you case. Here is the link

Thanks everyone.

Pasoleatis, your explanation made it all click in my mind, and thanks for the link. I’m visualising a whole queue of blocks ready for processing where each block has a multiple of 32 threads. Actually what I am doing is more like…

doCalc(i, j, k) where for each kernel launch there will be multiple myArrays all of size N (each array for a different type of data but belong together); so having a structure of arrays (as is recommended).

This is what I’m thinking, correct me if I’m wrong.

  1. Load the myArray(s) into constant memory.
  2. In the kernel, load all the myArrays into shared memory and do sync threads
  3. do the calculations reading the values from the arrays in shared memory

I won’t have N > 1024. But if I do then I can discard data beyond that size anyway so the < < < dim3(N,N),N > > > launch is fine.

The other factor is that the number of kernel launches per second varies from 10,000+ to in extreme cases 2,000,000. I’m hoping kernel launches can be interleaved (or at least queued) as each run operates on separate independent data? Titan card doesn’t have HyperQ so not sure if that will be a problem? I’m also assuming the overhead for launching kernels is relatively small?

So if I do multiple kernel launches in a short space of time, then each multiprocessor can have blocks queued up from different kernel launches, is that correct?

Thanks for any help,


Titan cards do have HyperQ. It supports up to 8 queues (or mpi jobs). There is also the possibility of using streams.

The constant memory increases performance in some cases. In other cases it might be faster to use global memory and let the compiler optimize it by using the L1 and L2 cache. If all the data fits in the share memory, you could load all of it in shared memory and do all the operation within 1 block.

In fact,

The data arrays come in handy batches of 8-20, so I can launch with a number of blocks at once and in each block copy its corresponding array numbers to shared_memory and do loops in each block for j & k. That might be optimal. I’ll try, and benchmark it to see.


I wrote a very fast way of visiting only once (without any returns etc) of three array elements:

In my example it visits all three unique combinations of array elements, then evaluates that combination, scans, reduces and returns the optimal result and a combination which gave that result (if there is more than one it will return a valid combination, but necessarily the first it encountered).

In this case the running time for the entire algorithm is c*(N choose 3)*N, where N is the number of elements. The last N is due to the evaluation step of the combination, and in my example case it is trying to find the three point combination(which forms a triangle) then it sees which triangle contains the greatest number of points within.I used the variable c as the time it takes to derive the three element combination from the offset number. The CPU version does not have to do this step because it has the nested loops. So in fact the CUDA version is doing more work, and still absolutely kills the serial version. The CPU implementation is in the code as well, though be prepared to wait a while while it runs with a larger N.

This method I used is much faster than the pure brute force method, and is at least 1000 times faster than a serial 3.9 Ghz implementation which does the same thing. Another advantage is that it can handle arrays with N being greater than the max number of threads(1024).

Keep in mind though that the numbers of combinations get large as N>500, so I used the long long 64 bit type. GPUs with poor 64 bit performance may not see the same results.

If anybody with a 780, 780ti,Titan or Quadro K6000 have a free moment, would love to benchmark the above code on those GPUs.

If on linux you only need to change the timing function. can comment out the CPU part to save time.

Gave it a try, changed timing code to use gettimeofday() on linux, time result is in seconds.
Running a GTX Titan on linux Mint 16

Expected iterations CPU= 10354250000

Expected iterations GPU= 113896750000

Running CPU implementation..
CPU solution timing: 126.01
CPU best value= 254 , point indexes ( 369 , 301 , 185 ).
CUDA timing: 0.630161
GPU best value= 254 , point indexes ( 369 , 301 , 185 ).

Success. GPU results match CPU results!. GPU was 199.966 faster than 3.9 ghz CPU.

And with Double Precision mode turned on:

Expected iterations CPU= 10354250000

Expected iterations GPU= 113896750000

Running CPU implementation..
CPU solution timing: 125.935
CPU best value= 240 , point indexes ( 300 , 255 , 92 ).
CUDA timing: 0.671099
GPU best value= 240 , point indexes ( 300 , 255 , 92 ).

Success. GPU results match CPU results!. GPU was 187.655 faster that 3.9 ghz CPU.

Wow, changing some optimization flags changed things quite a bit.
Compiled with

nvcc -O3 --use_fast_math -m64 -v -Xcompiler -march=native -Xcompiler -ffast-math -gencode arch=compute_35,code=sm_35 -odir "src" -M -o "EXP3.d" ""

I get (single precision mode):

Expected iterations CPU= 10354250000
Expected iterations GPU= 113896750000

Running CPU implementation..
CPU solution timing: 83.5187
CPU best value= 254 , point indexes ( 343 , 229 , 194 ).
CUDA timing: 0.313176
GPU best value= 254 , point indexes ( 343 , 229 , 194 ).

Success. GPU results match CPU results!. GPU was 266.683 faster that 3.9 ghz CPU.

Thanks for the test. Kind of surprised that my Tesla K20c time in Windows for this were better than the Titan. I wonder why this the case, this was for 500 points right?

I got 240 ms round trip for 500 points.

Yes, didn’t change the N 500.

One other thing that caused compilation problems was:

adj=(long long(ii)<<8LL);

changed those occurrences to

adj=((long long)ii<<8LL);

Not sure if that’s a possible cause for slowdown.


BTW what CPU setup do you have? It is really really fast, and I am curious about that since mine runs at 3.9 GHz.

Not sure why that code runs slower on the Titan than K20, but probably is not from that long long cast.


Inspired by Gogar I also made some optimizations and improved the running time of this algorithm, and updated my repository.

By using constant memory for the float2 array, I reduced the CUDA implementation running time by about 10-15%.

The CPU times are also more line with Gogar’s now, so the overall difference between the implementations is about 450x, which seems more realistic.