Sudden drop in CUDA/thrust perfomance

Hi All,

I have developed and I am using regularly a CUDA/Thrust library for finite element simulations, for which performance is critical. The library has a testing framework that monitors execution time of most operations on a known dataset, ensuring that performance is consistent. In the last 2 years, through several CUDA and driver updates, performance of the library has been consistent (i.e. timing of operations never increased).

About 3 weeks ago I have upgraded to CUDA 8.0 and the testing framework passed all the tests, confirming that performance of the library was at least as before.

Suddenly, and without having recompiled the binary, I am experiencing significant drops in performance. Certain operations based on a mix of custom kernels / thrust that run in 2ms run now in 50ms, and others that run in 60ms run now in 170ms.

I cannot explain what has caused this degradation, as I have not recompiled the code, the configuration of the PC has not changed, and not other application is running which could interfere. I might have upgraded the CUDA driver, but I cannot tell if the upgrade was before or after I run the last testing. I have tried therefore to use an older driver version than the latest version (i.e. I have reinstalled ver 372.70 of August 30th), but to no avail, performance continues to be poor.

Has anyone ever experienced anything similar? Does anyone have a hint at what might be the cause?

The configuration I am using is as follows:

GPU: GTX Titan Black
OS: Windows 10 64bit Professional
CUDA: 8.0 64bit

Any comment is highly appreciated.

Thank you and Best Regards,


With only the scantest of information (we don’t know what this app;ication does, only that its performance was still fine directly after the switch to CUDA 8) combined with your assertion that hardware and software configuration have not changed since the switch to CUDA 8, I really don’t see how one can offer more than wildest speculation. A few random ideas:

(1) There were other apps running on the machine when you did the test run
(2) A CUDA environment variable has been set that impacts performance, e.g. CUDA_PROFILE=1 (not sure that particular one still exist)
(3) The slowdown is actually due to the host portion of your application, maybe caused by an automatic update of Windows 10
(4) In the app’s configuration, GPU acceleration was inadvertently disabled, or a required license expired with the same effect

You may want to look at the output of nvidia-smi to see whether you spot anything “odd”, and run the app with the NVIDIA profiler to check if there is anything “unusual”.

Do you make any SGEMM calls? (or perhaps thrust does?) I noticed in my own code that the cuBLAS bundled with the final release of CUDA 8 in some cases performs significantly slower compared to the Release candidate of CUDA 8 on my SM3.5 hardware.

The specific issue for me seems to lie in cublas no longer utilizing a specialized “largek” kernel.

CUDA 8.0.27-20733550 RC:

nvprof --unified-memory-profiling off ./modelTrainer

==10137== Profiling application: ./modelTrainer
==10137== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 30.05%  981.18ms       192  5.1103ms  4.4455ms  6.7025ms  void sgemm_largek_lds64<bool=1, bool=0, int=5, int=5, int=4, int=4, int=4, int=34>(float*, float const *, float const *, int, int, int, int, int, int, float const *, float const *, float, float, int, int, int*, int*)
 23.95%  782.08ms       130  6.0160ms  4.2798ms  6.9480ms  sgemm_sm35_ldg_nt_64x16x64x16x16
 15.43%  504.01ms        64  7.8751ms  6.9755ms  8.6578ms  decorGrad_kernel3(float const *, float*, unsigned int, unsigned int)
 13.51%  441.04ms        64  6.8912ms  6.2011ms  7.3529ms  sgemm_sm35_ldg_nn_64x16x64x16x16
CUDA 8.0.44-21122537 Release:

nvprof --unified-memory-profiling off ./modelTrainer

==6506== Profiling application: ./modelTrainer
==6506== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 89.93%  20.7032s       192  107.83ms  103.66ms  127.36ms  sgemm_sm35_ldg_tn_32x16x64x8x16
  3.45%  794.09ms       130  6.1084ms  4.2786ms  7.8776ms  sgemm_sm35_ldg_nt_64x16x64x16x16
  2.22%  510.89ms        64  7.9827ms  6.9781ms  9.3995ms  decorGrad_kernel3(float const *, float*, unsigned int, unsigned int)
  1.94%  447.06ms        64  6.9854ms  6.2167ms  8.0100ms  sgemm_sm35_ldg_nn_64x16x64x16x16

Average sgemm call time went from 5.1103ms to 107.83ms.

OS: Ubuntu 16.04 (4.4.0-38-generic, x86_64 GNU/Linux)
NVIDIA Driver Version: 370.28

@Gogar: Interesting. Have you filed a bug with NVIDIA for this?


Thanks so much for your comments and advise. I confirm nothing else is going on on the computer used for testing (e.g. other processes, licensing issues, GPU disabled). I will though follow up using nvidia-smi and profiler, and see what I can make out.

Gogar: thanks for the SGEMM calls tip, but I am not making any nor the underlying Thrust code is (I am using Thrust for some sorting). Also I confirm that initially, after migrating to CUDA 8.0, the performance was confirmed to be consistent with what expected, and only later I have experienced a drop in performance.

I will investigate further and post back anything I can find.

Thanks for your tips and your time,

Best Regards,