Those optimizations made an even larger difference in 2 GPU permutation problem, but I still need to better split the workload, as the updated version assumes equal capabilities.

For generating and evaluating (against a linear N step test function) all permutations of an array 14 elements,

(which is 87,178,291,200 distinct arrangements),

it takes 14 seconds.

With both GPUs running concurrently the GTX 980 finishes the first half in 14 seconds while the Titan X simultaneously finishes the second half in 11.13 seconds.

```
Starting GPU testing:
Multi-GPU implementation
GPU #0=GeForce GTX TITAN X
GPU #1=GeForce GTX 980
rem_start0= 43589042176, rem_start1= 87178187776
GPU timing: 14.026 seconds.
ans0= 8776.32, permutation number 51789820077
ans1= 8738.38, permutation number 28318741677
GPU optimal answer is 8738.38
Permutation as determined by OK CUDA implementation is as follows:
Start value= -7919.02
Using idx # 4 ,input value= -12345.7, current working return value= -8604.89
Using idx # 8 ,input value= -1111.2, current working return value= -8657.8
Using idx # 1 ,input value= -333.145, current working return value= -8683.43
Using idx # 6 ,input value= -27.79, current working return value= -8685.07
Using idx # 12 ,input value= -42.0099, current working return value= -8686.98
Using idx # 11 ,input value= -1.57, current working return value= -8687.05
Using idx # 9 ,input value= 0.90003, current working return value= -8687
Using idx # 13 ,input value= 3.12354, current working return value= -8686.84
Using idx # 5 ,input value= 2.47, current working return value= -8686.62
Using idx # 10 ,input value= 10.1235, current working return value= -8685.95
Using idx # 7 ,input value= 8.888, current working return value= -8685.14
Using idx # 2 ,input value= 7.1119, current working return value= -8683.71
Using idx # 3 ,input value= 127.001, current working return value= -8658.31
Using idx # 0 ,input value= 31.4234, current working return value= -8626.89
Absolute difference(-8626.89-111.493)= 8738.38
==6756== Profiling application: ConsoleApplication1.exe
==6756== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
259.83ms 13.312us - - - - - 1.3302MB 99.927GB/s GeForce GTX TIT 1 7 [CUDA memset]
364.32ms 14.240us - - - - - 1.3302MB 93.415GB/s GeForce GTX 980 2 14 [CUDA memset]
364.33ms 14.0135s (332558 1 1) (256 1 1) 27 156B 0B - - GeForce GTX 980 2 14 void _gpu_perm_14<int=131072>(float*, int2*, D_denoms_
14_local, float, float, int) [196]
364.44ms 11.1385s (332558 1 1) (256 1 1) 28 156B 0B - - GeForce GTX TIT 1 7 void _gpu_perm_14_split<int=131072>(float*, int2*, D_d
enoms_14_local, float, float, int, __int64) [206]
11.5030s 2.3400ms (1 1 1) (256 1 1) 29 96B 0B - - GeForce GTX TIT 1 7 _gpu_perm_last_step_14(float*, int2*, D_denoms_14_loca
l, float, float, __int64, int, __int64, int) [217]
11.5053s 2.2080us - - - - - 4B 1.8116MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
11.5054s 2.1760us - - - - - 8B 3.6765MB/s GeForce GTX TIT 1 7 [CUDA memcpy DtoH]
14.3779s 2.0065ms (1 1 1) (256 1 1) 29 96B 0B - - GeForce GTX 980 2 14 _gpu_perm_last_step_14(float*, int2*, D_denoms_14_loca
l, float, float, __int64, int, __int64, int) [231]
14.3800s 2.4000us - - - - - 4B 1.6667MB/s GeForce GTX 980 2 14 [CUDA memcpy DtoH]
14.3801s 1.9200us - - - - - 8B 4.1667MB/s GeForce GTX 980 2 14 [CUDA memcpy DtoH]
```