Multi-GPUs not of the same type, how to evenly split work among them?

I have some problems which are easy to split and execute concurrently, but have different GPUs in the PC.

As you can imagine in such a situation the running time will only be as fast as the slowest GPU.

If it was a simple 32 bit compute-bound application, or a memory bound application this calculation determining the work per different GPU would be easier, but in this case it is a bit more complicated.

The most frequent math done in the kernels is 64 bit integer subtraction. Not sure how to determine that 64 bit capability from the 32 bit GIOPS, or even the 64 bit GFLOPs.

For example the 1.0 GHz Titan X is about 25% faster than the 1.3 GHz GTX 980 for this task, so I have used that relationship to break up the work. Would like a more accurate approach if possible.

I can use trial-and-error to get this right, but wondering how others have dealt with similar situations.

Would it be possible to construct a semi-representative, yet short-running, test kernel that you can run on all attached GPUs at application startup? That would allow your app to adjust the load splitting fairly accurately without knowing anything about the GPUs per se, and for arbitrary mixes of GPUs. It would be future-proof as well. In effect, a self-tuning load-balancing mechanism.

njuffa seems to argue for estimating execution time (of/ per function/ kernel) per device

an alternative approach i generally favour is to sub-divide the problem sufficiently into sub-problems, to allow more dynamic work distribution
sub-problems have a fraction of the calculation time of problems; hence, device idle time due to imperfect work distribution can only be a fraction of calculation time
work is forward issued and distributed on the fly, knowing that the ‘distribution error’ would only be a fraction of calculation time of the problem

I lauch one massive kernel on both GPUs, then a ‘clean-up’ for each GPU after…

Also figured out that I should change the launch configuration for each GPU as well, as it makes a difference.

Thanks for the suggestions!

Got the permutation generation/evaluation nicely split so both GPUs finish within 100 ms of each other.

Now am working on this simpler “Magic Square” brute force problem for the same 2 GPU configuration. For some reason this one is a bit trickier to get right.
It is an easier problem than the permutation problem, but has a much larger problem space (over 33 trillion arrangements of a game board).

This is the current incarnation for a single GPU:

Again it is a bunch of 64 bit integer multiplication and subtraction, with lots of thread-local register usage.

I must be overlooking the 64-bit multiplications. The bulk of the work seems to be in GPU_step0(), and that seems to be busy computing many divisions with remainder, where all quotients are in [0,6]?

Since the divisions are by constants, they could actually be coded as such, letting the compiler convert them to multiplications with the inverse under the hood, but due to the lengthy emulation sequence for 64-bit multiplication on Maxwell, that is unlikely to be a win? Hard to tell.

By the way, I don’t see the need to have all the loops use long long data. int should suffice for most of them and be faster. Starting at this loop ‘a’ fits into an ‘int’:


Since integer divisions with compile-time constant divisor are usually somewhat faster for ‘unsigned int’ vs ‘int’, an experiment seems worthwhile that switches ‘a’ and ‘pos’ from ‘long long int’ to ‘unsigned int’ once the magnitude of the data allows it, then computes:

unsigned int a, pos;
uint2 tx;
Arr[10] = t2.x = pos / 282475249;
pos -= t2.x * 282475249;

This should resolve into something like a 32-bit mul.wide, a regular 32-bit multiply, a right shift, and a subtraction. Even with 32-bit integer multiplies being emulated on Maxwell, I would think this should be a win on average, compared to the loop solution.

“Now am working on this simpler “Magic Square” brute force problem for the same 2 GPU configuration. For some reason this one is a bit trickier to get right.”

i did not look at the code, only the functional description (and a well-written description i may add)
based on that, i would again suggest the method of sub-dividing the problem sufficiently into sub-problems

you could easily divide the problem space up into (unions of) a number of sub-problem-spaces, and push the latter for processing
combinatorics-based problems tend to be easy to brake up, because a combination (or permutation) is generally a union of sub-combinations

i am not going to consider your case of a 4x4 matrix; rather, consider the elementary case of a 1x6 combination
if each element or location is allowed a value between 0 - 6, note all possible sub-combinations for the first 1 or first 2 (n) locations, count these, and push it into a work stack
use a host atomic as pointer into the work stack
the host thread(s) minding the devices draw sub-combinations as work from the work stack via the (atomic) pointer (they know the quantity of work on the stack, thus when to stop)
the devices then complete the sub-combinations
pushing the first location into the work stack would amount to 7 sub-tasks, and a work stack of [0, 1, 2, 3, 4, 5, 6]; pushing the first 2 locations into the work stack would amount to 7 x 7 sub-tasks, and a work stack of [0,0; 0,1;…1,5; 1,6]
while loops too can be used to brake up the work, preventing the work stack from growing too big, whilst at the same time having the problem space sufficiently broken up
from this is should be clear that, as mentioned, combinatorics-based problems are relatively easy to brake up

i also would question whether a brute force approach is the optimal implementation, given that you have a well formulated description of what constitutes a valid or ‘successful’ or ‘winning’ combination
again the elementary case of 1x4, not 4x4:
if the sum of the 4 locations must be 10, and the values of the 1st and 2nd location are already selected as 2 and 4 respectively, you already know a lot of valid values for the 3rd and 4th locations
hence, neither the 3rd nor 4th value can have a value greater than 4; etc, etc
considering any sub-combination with value greater than 4 for the 3rd location is redundant
applying such knowledge would allow you to jump combinations, rather than merely step combinations

I abuse an evil hack to get efficient fine grained work distribution for multiGPU. It’d work well for your solver too.

It works best when the work can be broken into small tasks which can be run in any order and do not depend on each other, much like a grid of blocks except you want to run them on many (perhaps heterogeneous) GPUs with load balancing.

Create a zero-copy host integer, initialized to 0. Start the kernel on all GPUs. You don’t need many blocks, just enough to fill all SMMs to capacity.

Each block, in a loop, atomic increments the host integer by 1 and uses that as their “block number”. It does its computation as usual, and reports/saves its answers, then loops back for another index until the index is past the grid size. That’s it! It does not use block.idx for work allocation, just the atomic.

Most experienced CUDA programmers are currently glaring at this post and preparing a scathing reply about how atomic consistency is not guaranteed over PCIE, this is going to cause problems and you are playing with fire and will get burned. My friends, please calm yourselves. I agree with you!

The atomic updates WILL fail with multiple GPUs. They’ll fail by having some values skipped over. They’ll fail by having some blocks receive the same duplicate value as other blocks.

BUT… what is the behavior of the failure? It’s is common, with errors every time you launch, but not not frequent in an absolute sense. As long as your kernel isn’t a tiny microsecond loop hammering the atomic address, the failures are rare. In my 1ms compute “blocks”, the failures are about one in 50,000 increments. This is managable. Your work needs to be resilient to double-distribution. The magic square search is: it’s A-OK if two blocks happen to check the same configuration redundantly… it’s just a tiny lost efficiency.

You also need to be resilient to (sparse!) holes in the work. There may be some (rare!) indices which NO block gets assigned. I handle this by recording final results in an array, and after the whole (multi-GPU) compute is finished I scan the array for holes, and launch the whole kernel all over again (still using the reset atomic!) and the blocks CHECK the array to see if it’s needed to be run. This fills in the holes and a third pass could be done if there are still holes (very very unlikely).

It’s all an ugly hack, and with ugly workarounds to fill in holes, and with ugly inefficiency because some work is double-computed.

But it works. It’s just 4 lines of code in the kernel, and 7 lines of code on the host. It works with any number of GPUs. And it load balances any mix of GPU speeds. It works, unchanged, for 1 GPU too. And it works very efficiently. The atomic failures are so rare that there’s very little wasted work. The load balancing works really well, with a multi-hour compute on 5 disparate GPUS (a dual GPU GTX 590, 2 K20s, and 1 GTX 750Ti) all finishing within a few milliseconds of each other.

Volta, with UVM, may even make this hack legitimate since it’s likely its atomic consistency over PCIE is part of the design.

All really good suggestions, thanks.

For some reason I thought 1 division of 64 bit integers would not be faster than up to 7 64 bit subtractions.

I suspect you are indeed correct and going to test both suggestions tomorrow.
In this type of application these optimizations can make a big difference(like the ‘nth unset bit’ problem).

@SPWorley -> Wow that is something I have to try. Would have never thought of that work distribution scheme.

The important thing regarding these divisions is that they are all by compile-time constants. This makes them much faster than divisions with variable divisors. It is likely that a 64-bit division by constant on Maxwell is still slower than the six-iteration loop, but it is likely that a 32-bit unsigned division by constant is faster than the loop. My observation is that the majority of the divisions in the code can in fact be 32-bit unsigned integer division, i.e. the use of 64-bit integer operations (which must be emulated on the GPU) seems largely unnecessary.

Why don’t you use Async-Stream?
You can trigger kernel threads with same computation load.
If each thread on a differnt GPU returns in different timing, you do not have to wait.

Please see cudaStreamCreate().

Switching to a 32 register instead of 64 bit from the Arr[10] point and using division rather than repeated subtraction brought down the running time to about 3.34 minutes from 4.29 minutes(single Titan X):

GPU timing= 200788 ms

Optimal score = 10
1 0 4 5
4 4 1 1
1 0 5 4
4 6 0 0

number = 646027679200

As usual a golden recommendation. This change can also be applied to the permutation problem as well, so that is my next step.

This class of problem is not as glamorous as the other CUDA applications, but the degree of out-performance of GPU to CPU is far, far above the often referenced 10:1 ratio.

Still need to try the 2 GPU implementation with the GTX 980 and the Titan X, which now could crack the 2 minute mark since it splits very well.


I wondered if SPWorley’s suggestion might be applied to a single GPU where the “host integer”
used to allocate tasks to blocks would be moved to global memory and read/updated without
atomics. I guess questions are:

  1. how much does atomic increment on global memory cost?
  2. Will moving the task allocation counter from the host to the GPU mean either more holes
    or more duplicated work.
    Just a thought

ps: There was a scheme a bit like this proposed at UKMAC 2014 by Aidan Chalk.
They also had a presentation at UKMAC 2012 in Bristol.

Those optimizations made an even larger difference in 2 GPU permutation problem, but I still need to better split the workload, as the updated version assumes equal capabilities.

For generating and evaluating (against a linear N step test function) all permutations of an array 14 elements,
(which is 87,178,291,200 distinct arrangements),
it takes 14 seconds.

With both GPUs running concurrently the GTX 980 finishes the first half in 14 seconds while the Titan X simultaneously finishes the second half in 11.13 seconds.

Starting GPU testing:

Multi-GPU implementation

GPU #1=GeForce GTX 980

rem_start0= 43589042176, rem_start1= 87178187776
GPU timing: 14.026 seconds.

ans0= 8776.32, permutation number 51789820077
ans1= 8738.38, permutation number 28318741677
GPU optimal answer is 8738.38

Permutation as determined by OK CUDA implementation is as follows:
Start value= -7919.02
Using idx # 4 ,input value= -12345.7, current working return value= -8604.89
Using idx # 8 ,input value= -1111.2, current working return value= -8657.8
Using idx # 1 ,input value= -333.145, current working return value= -8683.43
Using idx # 6 ,input value= -27.79, current working return value= -8685.07
Using idx # 12 ,input value= -42.0099, current working return value= -8686.98
Using idx # 11 ,input value= -1.57, current working return value= -8687.05
Using idx # 9 ,input value= 0.90003, current working return value= -8687
Using idx # 13 ,input value= 3.12354, current working return value= -8686.84
Using idx # 5 ,input value= 2.47, current working return value= -8686.62
Using idx # 10 ,input value= 10.1235, current working return value= -8685.95
Using idx # 7 ,input value= 8.888, current working return value= -8685.14
Using idx # 2 ,input value= 7.1119, current working return value= -8683.71
Using idx # 3 ,input value= 127.001, current working return value= -8658.31
Using idx # 0 ,input value= 31.4234, current working return value= -8626.89

Absolute difference(-8626.89-111.493)= 8738.38

==6756== Profiling application: ConsoleApplication1.exe
==6756== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
259.83ms  13.312us                    -               -         -         -         -  1.3302MB  99.927GB/s  GeForce GTX TIT         1         7  [CUDA memset]
364.32ms  14.240us                    -               -         -         -         -  1.3302MB  93.415GB/s  GeForce GTX 980         2        14  [CUDA memset]
364.33ms  14.0135s         (332558 1 1)       (256 1 1)        27      156B        0B         -           -  GeForce GTX 980         2        14  void _gpu_perm_14<int=131072>(float*, int2*, D_denoms_
14_local, float, float, int) [196]
364.44ms  11.1385s         (332558 1 1)       (256 1 1)        28      156B        0B         -           -  GeForce GTX TIT         1         7  void _gpu_perm_14_split<int=131072>(float*, int2*, D_d
enoms_14_local, float, float, int, __int64) [206]
11.5030s  2.3400ms              (1 1 1)       (256 1 1)        29       96B        0B         -           -  GeForce GTX TIT         1         7  _gpu_perm_last_step_14(float*, int2*, D_denoms_14_loca
l, float, float, __int64, int, __int64, int) [217]
11.5053s  2.2080us                    -               -         -         -         -        4B  1.8116MB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]
11.5054s  2.1760us                    -               -         -         -         -        8B  3.6765MB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]
14.3779s  2.0065ms              (1 1 1)       (256 1 1)        29       96B        0B         -           -  GeForce GTX 980         2        14  _gpu_perm_last_step_14(float*, int2*, D_denoms_14_loca
l, float, float, __int64, int, __int64, int) [231]
14.3800s  2.4000us                    -               -         -         -         -        4B  1.6667MB/s  GeForce GTX 980         2        14  [CUDA memcpy DtoH]
14.3801s  1.9200us                    -               -         -         -         -        8B  4.1667MB/s  GeForce GTX 980         2        14  [CUDA memcpy DtoH]