A Performance regression on CUDA 7.0 final

Hello to all. After recompiling the program with CUDA 7.0 it was noticed a strong deceleration (approximately two times) compared with CUDA 6.5.
This program is widely known - this is a test BT of NASA NPB ver3.3, written in Fortran language. A high-level programming language Fortran-DVMH is used to parallelization of the test.
The text of this program has been optimized and expanded with directives of FDVMH language.
Our compiler (Fortran DVMH) creates the following output code for this test:

  • bt.DVMH.f - the base serial code of the program expanded RTS-DVMH calls
  • bt.DVMH_cuda.cu - cuda-handlers and cuda-kernels for each parallel loop
  • bt.DVMH_cuda_info.c - special cuda information for RTS-DVMH

consider the compilation of bt.DVMH_cuda.cu. The command is the following:

  • /opt/cuda/cuda-6.5/bin/nvcc -arch=sm_35 -O3 -Xptxas -v -I/home/DVM/dvm_current/dvm_sys/include -c bt.DVMH_cuda.cu
  • /opt/cuda/cuda-7.0/bin/nvcc -arch=sm_35 -O3 -Xptxas -v -I/home/DVM/dvm_current/dvm_sys/include -c bt.DVMH_cuda.cu

Our DVMH compiler also processes CUDA PtxAs information and convert it to readable form. Below is a output of these variants of compilation:
CUDA 6.5 PTXAs:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z23loop_bt_834_cuda_kernelPdiiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiS_iiiiS_iiiPiddddddddddddddiddddddddddddddddddddd' for 'sm_35'
ptxas info    : Function properties for _Z23loop_bt_834_cuda_kernelPdiiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiS_iiiiS_iiiPiddddddddddddddiddddddddddddddddddddd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 120 registers, 832 bytes cmem[0]
ptxas info    : Compiling entry function '_Z24loop_bt_3177_cuda_kernelPdiiiiS_iidS_dS_dS_dS_dS_Piiddd' for 'sm_35'
ptxas info    : Function properties for _Z24loop_bt_3177_cuda_kernelPdiiiiS_iidS_dS_dS_dS_dS_Piiddd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 77 registers, 480 bytes cmem[0]
ptxas info    : Compiling entry function '_Z23loop_bt_294_cuda_kernelPiiPdiiiiS0_iiS_ddd' for 'sm_35'
ptxas info    : Function properties for _Z23loop_bt_294_cuda_kernelPiiPdiiiiS0_iiS_ddd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 223 registers, 408 bytes cmem[0]
ptxas info    : Compiling entry function '_Z24loop_bt_1677_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi' for 'sm_35'
ptxas info    : Function properties for _Z24loop_bt_1677_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 252 registers, 524 bytes cmem[0], 32 bytes cmem[2]
ptxas info    : Compiling entry function '_Z23loop_bt_811_cuda_kernelPdiiiiS_iiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiPi' for 'sm_35'
ptxas info    : Function properties for _Z23loop_bt_811_cuda_kernelPdiiiiS_iiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiPi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 25 registers, 544 bytes cmem[0], 32 bytes cmem[2]
ptxas info    : Compiling entry function '_Z24loop_bt_2300_cuda_kernelPdiiiiiS_iiiiS_iiiiPiiddddddddddddddd' for 'sm_35'
ptxas info    : Function properties for _Z24loop_bt_2300_cuda_kernelPdiiiiiS_iiiiS_iiiiPiiddddddddddddddd
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 248 registers, 536 bytes cmem[0], 32 bytes cmem[2]
ptxas info    : Compiling entry function '_Z24loop_bt_1053_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi' for 'sm_35'
ptxas info    : Function properties for _Z24loop_bt_1053_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 247 registers, 524 bytes cmem[0], 36 bytes cmem[2]
ptxas info    : Compiling entry function '_Z24loop_bt_3238_cuda_kernelPdiiiidS_dS_dS_dS_dS_Pii' for 'sm_35'
ptxas info    : Function properties for _Z24loop_bt_3238_cuda_kernelPdiiiidS_dS_dS_dS_dS_Pii
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 25 registers, 436 bytes cmem[0]
ptxas info    : Compiling entry function '_Z23loop_bt_282_cuda_kernelPdiiiiPi' for 'sm_35'
ptxas info    : Function properties for _Z23loop_bt_282_cuda_kernelPdiiiiPi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 352 bytes cmem[0]

OR CUDA 6.5 DVMH PTX Info:

Information of CUDA Ptx assembler for compiled module 'bt':
Compiled all kernels for sm_35 architecture
Used 0 bytes of global memory 
Loop on line 282:
  Used 16 registers
  Used 352 bytes of constant memory in bank 0

Loop on line 294:
  Used 223 registers
  Used 408 bytes of constant memory in bank 0

Loop on line 811:
  Used 25 registers
  Used 544 bytes of constant memory in bank 0, 32 bytes of constant memory in bank 2

Loop on line 834:
  Used 120 registers
  Used 832 bytes of constant memory in bank 0

Loop on line 1053:
  Used 247 registers
  Used 524 bytes of constant memory in bank 0, 36 bytes of constant memory in bank 2

Loop on line 1677:
  Used 252 registers
  Used 524 bytes of constant memory in bank 0, 32 bytes of constant memory in bank 2

Loop on line 2300:
  Used 248 registers
  Used 536 bytes of constant memory in bank 0, 32 bytes of constant memory in bank 2

Loop on line 3177:
  Used 77 registers
  Used 480 bytes of constant memory in bank 0

Loop on line 3238:
  Used 25 registers
  Used 436 bytes of constant memory in bank 0

AND CUDA 7.0 PTXas:

ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z23loop_bt_834_cuda_kernelPdiiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiS_iiiiS_iiiPiddddddddddddddiddddddddddddddddddddd' for 'sm_35'
ptxas info    : Function properties for _Z23loop_bt_834_cuda_kernelPdiiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiS_iiiiS_iiiPiddddddddddddddiddddddddddddddddddddd
    40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 86 registers, 832 bytes cmem[0]
ptxas info    : Compiling entry function '_Z24loop_bt_3177_cuda_kernelPdiiiiS_iidS_dS_dS_dS_dS_Piiddd' for 'sm_35'
ptxas info    : Function properties for _Z24loop_bt_3177_cuda_kernelPdiiiiS_iidS_dS_dS_dS_dS_Piiddd
    40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 42 registers, 480 bytes cmem[0]
ptxas info    : Compiling entry function '_Z23loop_bt_294_cuda_kernelPiiPdiiiiS0_iiS_ddd' for 'sm_35'
ptxas info    : Function properties for _Z23loop_bt_294_cuda_kernelPiiPdiiiiS0_iiS_ddd
    280 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 49 registers, 408 bytes cmem[0]
ptxas info    : Compiling entry function '_Z24loop_bt_1677_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi' for 'sm_35'
ptxas info    : Function properties for _Z24loop_bt_1677_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi
    800 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 198 registers, 524 bytes cmem[0], 20 bytes cmem[2]
ptxas info    : Compiling entry function '_Z23loop_bt_811_cuda_kernelPdiiiiS_iiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiPi' for 'sm_35'
ptxas info    : Function properties for _Z23loop_bt_811_cuda_kernelPdiiiiS_iiiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiS_iiiiPi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 23 registers, 544 bytes cmem[0], 20 bytes cmem[2]
ptxas info    : Compiling entry function '_Z24loop_bt_2300_cuda_kernelPdiiiiiS_iiiiS_iiiiPiiddddddddddddddd' for 'sm_35'
ptxas info    : Function properties for _Z24loop_bt_2300_cuda_kernelPdiiiiiS_iiiiS_iiiiPiiddddddddddddddd
    840 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 214 registers, 536 bytes cmem[0], 20 bytes cmem[2]
ptxas info    : Compiling entry function '_Z24loop_bt_1053_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi' for 'sm_35'
ptxas info    : Function properties for _Z24loop_bt_1053_cuda_kernelPdiiiiiS_iiiiS_iiiiPiidddddddddddddi
    800 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 198 registers, 524 bytes cmem[0], 24 bytes cmem[2]
ptxas info    : Compiling entry function '_Z24loop_bt_3238_cuda_kernelPdiiiidS_dS_dS_dS_dS_Pii' for 'sm_35'
ptxas info    : Function properties for _Z24loop_bt_3238_cuda_kernelPdiiiidS_dS_dS_dS_dS_Pii
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 24 registers, 436 bytes cmem[0]
ptxas info    : Compiling entry function '_Z23loop_bt_282_cuda_kernelPdiiiiPi' for 'sm_35'
ptxas info    : Function properties for _Z23loop_bt_282_cuda_kernelPdiiiiPi
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 16 registers, 352 bytes cmem[0]

OR CUDA 7.0 DVMH PTX Info:

Information of CUDA Ptx assembler for compiled module 'bt':
Compiled all kernels for sm_35 architecture
Used 0 bytes of global memory 
Loop on line 282:
  Used 16 registers
  Used 352 bytes of constant memory in bank 0

Loop on line 294:
  Used 49 registers
  Used 280 bytes stack frames
  Used 408 bytes of constant memory in bank 0

Loop on line 811:
  Used 23 registers
  Used 544 bytes of constant memory in bank 0, 20 bytes of constant memory in bank 2

Loop on line 834:
  Used 86 registers
  Used 40 bytes stack frames
  Used 832 bytes of constant memory in bank 0

Loop on line 1053:
  Used 198 registers
  Used 800 bytes stack frames
  Used 524 bytes of constant memory in bank 0, 24 bytes of constant memory in bank 2

Loop on line 1677:
  Used 198 registers
  Used 800 bytes stack frames
  Used 524 bytes of constant memory in bank 0, 20 bytes of constant memory in bank 2

Loop on line 2300:
  Used 214 registers
  Used 840 bytes stack frames
  Used 536 bytes of constant memory in bank 0, 20 bytes of constant memory in bank 2

Loop on line 3177:
  Used 42 registers
  Used 40 bytes stack frames
  Used 480 bytes of constant memory in bank 0

Loop on line 3238:
  Used 24 registers
  Used 436 bytes of constant memory in bank 0

And I run this test on CUDA 6.5 and CUDA 7.0 on GTX Titan with 346.47 driver version:
CUDA 6.5:

NAS Parallel Benchmarks 3.3.1 - DVMH version - BT Benchmark

 No input file inputbt.data. Using compiled defaults
 Size: 162x162x162
 Iterations: 200    dt:   0.000100
 Time step    1
 Time step   20
 Time step   40
 Time step   60
 Time step   80
 Time step  100
 Time step  120
 Time step  140
 Time step  160
 Time step  180
 Time step  200
 Verification being performed for class C
 accuracy setting for epsilon =  0.1000000000000E-07
 Comparison of RMS-norms of residual
           1 0.6239811655176E+04 0.6239811655176E+04 0.7287837774866E-15
           2 0.5079323919042E+03 0.5079323919042E+03 0.1119113877493E-15
           3 0.1542353009301E+04 0.1542353009301E+04 0.4422599899090E-15
           4 0.1330238792929E+04 0.1330238792929E+04 0.1709269618747E-15
           5 0.1160408742844E+05 0.1160408742844E+05 0.1097279377060E-14
 Comparison of RMS-norms of solution error
           1 0.1646200836909E+03 0.1646200836909E+03 0.1035901894587E-14
           2 0.1149710790382E+02 0.1149710790382E+02 0.3090093359582E-15
           3 0.4120744620746E+02 0.4120744620746E+02 0.6897226604947E-15
           4 0.3708765105969E+02 0.3708765105969E+02 0.1915847230703E-15
           5 0.3621105305184E+03 0.3621105305184E+03 0.1412802795364E-14
 Verification Successful


 BT Benchmark Completed.
 Class           =                        C
 Size            =              162x162x162
 Iterations      =                      200
 Time in seconds =                    27.70
 Mop/s total     =                103489.44
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1

CUDA 7.0:

NAS Parallel Benchmarks 3.3.1 - DVMH version - BT Benchmark

 No input file inputbt.data. Using compiled defaults
 Size: 162x162x162
 Iterations: 200    dt:   0.000100
 Time step    1
 Time step   20
 Time step   40
 Time step   60
 Time step   80
 Time step  100
 Time step  120
 Time step  140
 Time step  160
 Time step  180
 Time step  200
 Verification being performed for class C
 accuracy setting for epsilon =  0.1000000000000E-07
 Comparison of RMS-norms of residual
           1 0.6239811655176E+04 0.6239811655176E+04 0.8745405329840E-15
           2 0.5079323919042E+03 0.5079323919042E+03 0.2238227754985E-15
           3 0.1542353009301E+04 0.1542353009301E+04 0.2948399932726E-15
           4 0.1330238792929E+04 0.1330238792929E+04 0.1709269618747E-15
           5 0.1160408742844E+05 0.1160408742844E+05 0.6270167868913E-15
 Comparison of RMS-norms of solution error
           1 0.1646200836909E+03 0.1646200836909E+03 0.1035901894587E-14
           2 0.1149710790382E+02 0.1149710790382E+02 0.3090093359582E-15
           3 0.4120744620746E+02 0.4120744620746E+02 0.6897226604947E-15
           4 0.3708765105969E+02 0.3708765105969E+02 0.1915847230703E-15
           5 0.3621105305184E+03 0.3621105305184E+03 0.1412802795364E-14
 Verification Successful


 BT Benchmark Completed.
 Class           =                        C
 Size            =              162x162x162
 Iterations      =                      200
 Time in seconds =                    57.77
 Mop/s total     =                 49615.84
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1

Consideration of information does not need to comment. But it should be emphasized that the maximum number of registers is used, and may be that is why compiler does not work correctly.
All code - base and converted - is available for download by the following link:
https://drive.google.com/file/d/0BwkVJGSs_ksSUURyTVJtTTNmSVk/view?usp=sharing

If you want to compile bt.fdv (Fortran f77 with FDVMH directives) you should to install DVM-system on your PC. If you have questions about this process I am ready to help.

Note that there already is a thread dedicated to CUDA 7.0 performance regressions:

[url]https://devtalk.nvidia.com/default/topic/820603/cuda-programming-and-performance/who-else-is-seeing-a-performance-regression-on-cuda-7-0-final-/[/url]

Since these forums are not designed as a bug reporting channel, my recommendation would be to file a bug report with NVIDIA at your earliest convenience. The bug reporting form is linked from the CUDA registered developer website (log in at https://developer.nvidia.com). In case you are not yet a registered CUDA developer, sign up is straightforward and confirmation usually occurs within one business day (relative to the Western United States).

thank you for your response