Cuda 7.5 give a 30% performance loss vs cuda 6.5

CUDA 7.5 is slower than 6.5 for the majority of my reference applications;

Hardware : GTX Titan X reference no overclocking, using WDDM driver

OS: Windows 7 64

compile flags: --use_fast_math

Example #1, Permutation of 13 elements of an array against an test function, complexity N!*N +constant factors

CUDA 6.5:

Testing 13! version.
GPU timing: <b>1.383 seconds.</b>
GPU answer is: 8783.86

Permutation as determined by OK CUDA implementation is as follows:
Start value= -7919.02
Using idx # 4 ,input value= -12345.7, current working return value= -8645.24
Using idx # 8 ,input value= -1111.2, current working return value= -8700.8
Using idx # 1 ,input value= -333.145, current working return value= -8728.56
Using idx # 6 ,input value= -27.79, current working return value= -8730.29
Using idx # 12 ,input value= -42.0099, current working return value= -8732.29
Using idx # 11 ,input value= -1.57, current working return value= -8732.38
Using idx # 9 ,input value= 0.90003, current working return value= -8732.32
Using idx # 5 ,input value= 2.47, current working return value= -8732.1
Using idx # 10 ,input value= 10.1235, current working return value= -8731.42
Using idx # 7 ,input value= 8.888, current working return value= -8730.61
Using idx # 2 ,input value= 7.1119, current working return value= -8729.19
Using idx # 3 ,input value= 127.001, current working return value= -8703.79
Using idx # 0 ,input value= 31.4234, current working return value= -8672.37

Absolute difference(-8672.37-111.493)= 8783.86

CUDA 7.5

Testing 13! version.
GPU timing: <b>1.519 seconds.</b>
GPU answer is: 8783.86

Permutation as determined by OK CUDA implementation is as follows:
Start value= -7919.02
Using idx # 4 ,input value= -12345.7, current working return value= -8645.24
Using idx # 8 ,input value= -1111.2, current working return value= -8700.8
Using idx # 1 ,input value= -333.145, current working return value= -8728.56
Using idx # 6 ,input value= -27.79, current working return value= -8730.29
Using idx # 12 ,input value= -42.0099, current working return value= -8732.29
Using idx # 11 ,input value= -1.57, current working return value= -8732.38
Using idx # 9 ,input value= 0.90003, current working return value= -8732.32
Using idx # 5 ,input value= 2.47, current working return value= -8732.1
Using idx # 10 ,input value= 10.1235, current working return value= -8731.42
Using idx # 7 ,input value= 8.888, current working return value= -8730.61
Using idx # 2 ,input value= 7.1119, current working return value= -8729.19
Using idx # 3 ,input value= 127.001, current working return value= -8703.79
Using idx # 0 ,input value= 31.4234, current working return value= -8672.37

Absolute difference(-8672.37-111.493)= 8783.86

Link to exact source code used(Multi-GPU version is commented out but can be enabled. Will generate some warnings about unused variable related to CPU reference implementation which is toggled off due to save time);

https://sites.google.com/site/cudapermutations/

Example #2, brute force all N choose 3 combinations of triangles to determine which set of 2D points encompasses the highest number of internal points(using same random seed for point generation);

CUDA 6.5;

CPU solution timing: 90754
CPU best value= 246 , point indexes ( 222 , 153 , 48 ).
CUDA timing: <b>175</b>
GPU best value= 246 , point indexes ( 233 , 222 , 48 ).

Note: If there is more than one triangle which has the same optimal value, the GP

Success. GPU value matches CPU results!. GPU was 518.594 faster that 3.9 ghz CPU.

CUDA 7.5;

CPU solution timing: 90711
CPU best value= 246 , point indexes ( 222 , 153 , 48 ).
CUDA timing: <b>152</b>
GPU best value= 246 , point indexes ( 233 , 222 , 48 ).

Note: If there is more than one triangle which has the same optimal value, the GPU

Success. GPU value matches CPU results!. GPU was 596.783 faster that 3.9 ghz CPU.

Note: times in milliseconds calculated from host Windows timer

Link to source:

https://github.com/OlegKonings/CUDA_brute_triangle/blob/master/EXP3/EXP3/EXP3.cu

Example #3, double precision dynamic programming probability problem which I converted to a naive CUDA implementation;

CUDA 6.5;

num= 2000
CPU solution timing: 5983
CPU answer= 71.3739

CUDA timing(including all memory transfers and ops): <b>717</b> , answer= 71.3739

CUDA 7.5;

num= 2000
CPU solution timing: 5971
CPU answer= 71.3739

CUDA timing(including all memory transfers and ops): <b>724</b> , answer= 71.3739

link to source;

https://github.com/OlegKonings/DP13_suka/blob/master/EXP3/EXP3/EXP3.cu

Example #4, Monte Carlo simulation using cuRAND() of 10,000,000 optical photons through 3D volume of simple shapes, returning ppath and, start and exit data for each photon with multiple phase/scattering function dependent on medium(called via MATLAB mex file);

CUDA 6.5;

Using single GPU GeForce GTX TITAN X with compute capability 5.2 

Device bytes allocated=1400000560 

Time rand gen= 0.011000 

Running simulation saving exit data!

Time MC kernel =<b> 1.872000</b> 

Time debug kernel = 0.000000 

Tot detected= 6908937, Tot timed out= 0

CUDA 7.5;

Using single GPU GeForce GTX TITAN X with compute capability 5.2 

Device bytes allocated=1400000560 

Time rand gen= 0.009000 

Running simulation saving exit data!

Time MC kernel = <b>2.025000 </b>

Time debug kernel = 0.000000 

Tot detected= 6908932, Tot timed out= 0

Time in seconds.

Conclusion;

At least for my applications CUDA 6.5 is faster then CUDA 7.5 for 3 out of 4 applications. This does not account for any of the new features in CUDA 7.5, but at the same time one would expect the performance at least to be the same.
Granted this is a small sample set and the difference is not huge, so maybe it is noise…

Have you or were you planning to file bugs for the 3 that were slower? (It seems like you’ve done most of the work already…) Case 1 appears to be about 10% difference, Case 3 appears to be about 1% difference, and Case 4 appears to be about 10% ?

As a rule of thumb, performance changes of less than 5% should be considered noise. The compiler has tens of phases, a fair number of them controlled by heuristics, so considering the myriad combinations this gives rise to, that kind of noise level is unavoidable across releases, and is not unusual for compilers in general, CPU or GPU.

I would suggest filing bugs for any performance regressions > 20%, and in case of performance regressions > 5% in mission critical code (e.g. frame rate requirements in medical imaging). Providing an easy to run benchmark will definitely be a plus when filing a bug report. You may also want to consider to give permission to NVIDIA to use such benchmarks for their internal compiler performance tracking.

I will file a bug as you suggest.

Case #4 will be difficult to document as a reproducible, but I will try as that is the one which is most important to my work. What is additionally interesting is the fact that they return slightly different results (the number of photons which hit detectors) while using the same input parameters and the same random seed. Assuming that is not a bug in my code, that would mean the curandStatePhilox4_32_10_t uniform random number generation has changed, or some floating point calculations had a different result.
Will look into that now.

I’ll chime in too… I’m not pleased with 7.5 so far.

I’m still seeing massive spills in a few spots where there were none before with 7.0.18 RC (7.0 GA regressed).

I’ve found that retuning the launch_bounds settings with CUDA 7.5 or even removing any --maxrregcount flags for kernels helps a lot for performance on CUDA 7.5. Have you tried this retuning?

Thanks for the tip. Removing the launch_bounds and --maxregcount did improve alot. I still need to retune the threads per block…

With some tuning the x11 algo is now doing 2860KHASH vs 2940KHASH on a lowprofile 750ti (38w tdp)
(ccminer -a x11 --benchmark)

2.7% slower than cuda 6.5

The lyra2v2 algo showed an increase of 50khash from 4400 to 4450 (+1.1%)
(ccminer -a lyra2v2 --benchmark)

After two weeks of working with CUDA 7.5, I can confirm that most of the spill problems in my kernels are due to CUDA 7.x having issues with uninitialized variables.

A workaround and two examples are described in #5.

Note that initializing variables is cheap but not free.

I’ll leave others to hypothesize if this is a live variable analysis issue in ptxas.

Whatever the case, something changed in ptxas 7.x.

Onward!

My array permutation code did have uninitialized variables which I changed to default values and ran profiling comparing CUDA 6.5 and CUDA 7.5 with the same exact code;

This is with the TCC driver

CUDA 6.5;

==2288== Profiling application: ConsoleApplication1.exe
==2288== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
200.03ms  5.2480us                    -               -         -         -         -  247.44KB  44.965GB/s  GeForce GTX TIT         1         7  [CUDA memset]
200.08ms  1.38543s          (63344 1 1)       (256 1 1)        25      172B        0B         -           -  GeForce GTX TIT         1         7  void _gpu_perm_13<int=98304>(float*, int2*, D_denoms_14_local,
1.58557s  889.66us              (1 1 1)       (256 1 1)        27       96B        0B         -           -  GeForce GTX TIT         1         7  _gpu_perm_last_step_13(float*, int2*, D_denoms_14_local, float
int) [200]
1.58649s  2.1440us                    -               -         -         -         -        4B  1.7792MB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]
1.58650s  1.8880us                    -               -         -         -         -        8B  4.0410MB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]

CUDA 7.5;

==4696== Profiling application: ConsoleApplication1.exe
==4696== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
208.21ms  5.9520us                    -               -         -         -         -  247.44KB  39.646GB/s  GeForce GTX TIT         1         7  [CUDA memset]
208.26ms  1.49405s          (63344 1 1)       (256 1 1)        28      172B        0B         -           -  GeForce GTX TIT         1         7  void _gpu_perm_13<int=98304>(float*, int2*, D_denoms_14_local, floa
1.70236s  884.82us              (1 1 1)       (256 1 1)        30       96B        0B         -           -  GeForce GTX TIT         1         7  _gpu_perm_last_step_13(float*, int2*, D_denoms_14_local, float, flo
int) [200]
1.70327s  2.0800us                    -               -         -         -         -        4B  1.8340MB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]
1.70328s  1.8560us                    -               -         -         -         -        8B  4.1107MB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]

So CUDA 6.5 is still faster and uses less registers.

On the positive side my CT filtered back projection code using the RabbitCT benchmark data set is about 10% faster in CUDA 7.5 vs CUDA 6.5.

Odd stuff…

Interesting!

Did you have spills before? Or was it just a general performance difference?

None which appeared in the built output. For this application I just needed the registers used by the main kernel to be <=32 which is the case for both compiled outputs.

Maybe if some SASS ninja looked at the generated assembly they could see what changed in compilation between the two CUDA versions.

Hi there! Can you provide the NV bug that you filed :-)

It’s old, but it’s here:

NVIDIA Incident Report Update (1679834) - Cuda 7.5 give a 30% performance loss vs cuda 6.5

Anyway, I managed to recode the kernals to reduce the losses. (cuda 7.5 tuning)

Check out my github.

(Commits · sp-hash/ccminer · GitHub)

Here is my bitcointalk thread. ( 700 000 views) CCminer(SP-MOD) Modded NVIDIA Maxwell kernels.

I have 6 private miners for the people who donate (donators will get updates).

  1. 0.1BTC: Pentablake +100-120% (3 releases)
  2. 0.1BTC: Cryptonight +10% (one release)
  3. 0.1BTC: Spreadcoin +10-20% (with full sourcecode / linux compatible) (9 releases)
  4. 0.1BTC: All nicehash algos optimized. 0-10% (6 releases)(x11,x13,x15,nist5,quark,lyra2v2,neoscrypt)
  5. 0.1BTC: decred +18-25% (9 releases) (Full sourcecode(linux) 0.4BTC)
  6. 0.2BTC: Vcash(+13%+decred(+18-25%) (0.1 btc discount for the decred buyers) (6 releases)

Thanks for filing the bug. This situation should be improved in CUDA 8RC which should be available soon.