CUDA 7.5 is slower than 6.5 for the majority of my reference applications;
Hardware : GTX Titan X reference no overclocking, using WDDM driver
OS: Windows 7 64
compile flags: --use_fast_math
Example #1, Permutation of 13 elements of an array against an test function, complexity N!*N +constant factors
CUDA 6.5:
Testing 13! version.
GPU timing: <b>1.383 seconds.</b>
GPU answer is: 8783.86
Permutation as determined by OK CUDA implementation is as follows:
Start value= -7919.02
Using idx # 4 ,input value= -12345.7, current working return value= -8645.24
Using idx # 8 ,input value= -1111.2, current working return value= -8700.8
Using idx # 1 ,input value= -333.145, current working return value= -8728.56
Using idx # 6 ,input value= -27.79, current working return value= -8730.29
Using idx # 12 ,input value= -42.0099, current working return value= -8732.29
Using idx # 11 ,input value= -1.57, current working return value= -8732.38
Using idx # 9 ,input value= 0.90003, current working return value= -8732.32
Using idx # 5 ,input value= 2.47, current working return value= -8732.1
Using idx # 10 ,input value= 10.1235, current working return value= -8731.42
Using idx # 7 ,input value= 8.888, current working return value= -8730.61
Using idx # 2 ,input value= 7.1119, current working return value= -8729.19
Using idx # 3 ,input value= 127.001, current working return value= -8703.79
Using idx # 0 ,input value= 31.4234, current working return value= -8672.37
Absolute difference(-8672.37-111.493)= 8783.86
CUDA 7.5
Testing 13! version.
GPU timing: <b>1.519 seconds.</b>
GPU answer is: 8783.86
Permutation as determined by OK CUDA implementation is as follows:
Start value= -7919.02
Using idx # 4 ,input value= -12345.7, current working return value= -8645.24
Using idx # 8 ,input value= -1111.2, current working return value= -8700.8
Using idx # 1 ,input value= -333.145, current working return value= -8728.56
Using idx # 6 ,input value= -27.79, current working return value= -8730.29
Using idx # 12 ,input value= -42.0099, current working return value= -8732.29
Using idx # 11 ,input value= -1.57, current working return value= -8732.38
Using idx # 9 ,input value= 0.90003, current working return value= -8732.32
Using idx # 5 ,input value= 2.47, current working return value= -8732.1
Using idx # 10 ,input value= 10.1235, current working return value= -8731.42
Using idx # 7 ,input value= 8.888, current working return value= -8730.61
Using idx # 2 ,input value= 7.1119, current working return value= -8729.19
Using idx # 3 ,input value= 127.001, current working return value= -8703.79
Using idx # 0 ,input value= 31.4234, current working return value= -8672.37
Absolute difference(-8672.37-111.493)= 8783.86
Link to exact source code used(Multi-GPU version is commented out but can be enabled. Will generate some warnings about unused variable related to CPU reference implementation which is toggled off due to save time);
https://sites.google.com/site/cudapermutations/
Example #2, brute force all N choose 3 combinations of triangles to determine which set of 2D points encompasses the highest number of internal points(using same random seed for point generation);
CUDA 6.5;
CPU solution timing: 90754
CPU best value= 246 , point indexes ( 222 , 153 , 48 ).
CUDA timing: <b>175</b>
GPU best value= 246 , point indexes ( 233 , 222 , 48 ).
Note: If there is more than one triangle which has the same optimal value, the GP
Success. GPU value matches CPU results!. GPU was 518.594 faster that 3.9 ghz CPU.
CUDA 7.5;
CPU solution timing: 90711
CPU best value= 246 , point indexes ( 222 , 153 , 48 ).
CUDA timing: <b>152</b>
GPU best value= 246 , point indexes ( 233 , 222 , 48 ).
Note: If there is more than one triangle which has the same optimal value, the GPU
Success. GPU value matches CPU results!. GPU was 596.783 faster that 3.9 ghz CPU.
Note: times in milliseconds calculated from host Windows timer
Link to source:
https://github.com/OlegKonings/CUDA_brute_triangle/blob/master/EXP3/EXP3/EXP3.cu
Example #3, double precision dynamic programming probability problem which I converted to a naive CUDA implementation;
CUDA 6.5;
num= 2000
CPU solution timing: 5983
CPU answer= 71.3739
CUDA timing(including all memory transfers and ops): <b>717</b> , answer= 71.3739
CUDA 7.5;
num= 2000
CPU solution timing: 5971
CPU answer= 71.3739
CUDA timing(including all memory transfers and ops): <b>724</b> , answer= 71.3739
link to source;
https://github.com/OlegKonings/DP13_suka/blob/master/EXP3/EXP3/EXP3.cu
Example #4, Monte Carlo simulation using cuRAND() of 10,000,000 optical photons through 3D volume of simple shapes, returning ppath and, start and exit data for each photon with multiple phase/scattering function dependent on medium(called via MATLAB mex file);
CUDA 6.5;
Using single GPU GeForce GTX TITAN X with compute capability 5.2
Device bytes allocated=1400000560
Time rand gen= 0.011000
Running simulation saving exit data!
Time MC kernel =<b> 1.872000</b>
Time debug kernel = 0.000000
Tot detected= 6908937, Tot timed out= 0
CUDA 7.5;
Using single GPU GeForce GTX TITAN X with compute capability 5.2
Device bytes allocated=1400000560
Time rand gen= 0.009000
Running simulation saving exit data!
Time MC kernel = <b>2.025000 </b>
Time debug kernel = 0.000000
Tot detected= 6908932, Tot timed out= 0
Time in seconds.
Conclusion;
At least for my applications CUDA 6.5 is faster then CUDA 7.5 for 3 out of 4 applications. This does not account for any of the new features in CUDA 7.5, but at the same time one would expect the performance at least to be the same.
Granted this is a small sample set and the difference is not huge, so maybe it is noise…