Sudden drop in CUDA/thrust perfomance

andrea_nes · October 20, 2016, 3:05pm

Hi All,

I have developed and I am using regularly a CUDA/Thrust library for finite element simulations, for which performance is critical. The library has a testing framework that monitors execution time of most operations on a known dataset, ensuring that performance is consistent. In the last 2 years, through several CUDA and driver updates, performance of the library has been consistent (i.e. timing of operations never increased).

About 3 weeks ago I have upgraded to CUDA 8.0 and the testing framework passed all the tests, confirming that performance of the library was at least as before.

Suddenly, and without having recompiled the binary, I am experiencing significant drops in performance. Certain operations based on a mix of custom kernels / thrust that run in 2ms run now in 50ms, and others that run in 60ms run now in 170ms.

I cannot explain what has caused this degradation, as I have not recompiled the code, the configuration of the PC has not changed, and not other application is running which could interfere. I might have upgraded the CUDA driver, but I cannot tell if the upgrade was before or after I run the last testing. I have tried therefore to use an older driver version than the latest version (i.e. I have reinstalled ver 372.70 of August 30th), but to no avail, performance continues to be poor.

Has anyone ever experienced anything similar? Does anyone have a hint at what might be the cause?

The configuration I am using is as follows:

GPU: GTX Titan Black
OS: Windows 10 64bit Professional
CUDA: 8.0 64bit

Any comment is highly appreciated.

Thank you and Best Regards,

Andrea

njuffa · October 20, 2016, 4:49pm

With only the scantest of information (we don’t know what this app;ication does, only that its performance was still fine directly after the switch to CUDA 8) combined with your assertion that hardware and software configuration have not changed since the switch to CUDA 8, I really don’t see how one can offer more than wildest speculation. A few random ideas:

(1) There were other apps running on the machine when you did the test run
(2) A CUDA environment variable has been set that impacts performance, e.g. CUDA_PROFILE=1 (not sure that particular one still exist)
(3) The slowdown is actually due to the host portion of your application, maybe caused by an automatic update of Windows 10
(4) In the app’s configuration, GPU acceleration was inadvertently disabled, or a required license expired with the same effect

You may want to look at the output of nvidia-smi to see whether you spot anything “odd”, and run the app with the NVIDIA profiler to check if there is anything “unusual”.

Gogar · October 20, 2016, 6:06pm

Do you make any SGEMM calls? (or perhaps thrust does?) I noticed in my own code that the cuBLAS bundled with the final release of CUDA 8 in some cases performs significantly slower compared to the Release candidate of CUDA 8 on my SM3.5 hardware.

The specific issue for me seems to lie in cublas no longer utilizing a specialized “largek” kernel.

CUDA 8.0.27-20733550 RC:

nvprof --unified-memory-profiling off ./modelTrainer

==10137== Profiling application: ./modelTrainer
==10137== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 30.05%  981.18ms       192  5.1103ms  4.4455ms  6.7025ms  void sgemm_largek_lds64<bool=1, bool=0, int=5, int=5, int=4, int=4, int=4, int=34>(float*, float const *, float const *, int, int, int, int, int, int, float const *, float const *, float, float, int, int, int*, int*)
 23.95%  782.08ms       130  6.0160ms  4.2798ms  6.9480ms  sgemm_sm35_ldg_nt_64x16x64x16x16
 15.43%  504.01ms        64  7.8751ms  6.9755ms  8.6578ms  decorGrad_kernel3(float const *, float*, unsigned int, unsigned int)
 13.51%  441.04ms        64  6.8912ms  6.2011ms  7.3529ms  sgemm_sm35_ldg_nn_64x16x64x16x16

CUDA 8.0.44-21122537 Release:

nvprof --unified-memory-profiling off ./modelTrainer

==6506== Profiling application: ./modelTrainer
==6506== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 89.93%  20.7032s       192  107.83ms  103.66ms  127.36ms  sgemm_sm35_ldg_tn_32x16x64x8x16
  3.45%  794.09ms       130  6.1084ms  4.2786ms  7.8776ms  sgemm_sm35_ldg_nt_64x16x64x16x16
  2.22%  510.89ms        64  7.9827ms  6.9781ms  9.3995ms  decorGrad_kernel3(float const *, float*, unsigned int, unsigned int)
  1.94%  447.06ms        64  6.9854ms  6.2167ms  8.0100ms  sgemm_sm35_ldg_nn_64x16x64x16x16

Average sgemm call time went from 5.1103ms to 107.83ms.

OS: Ubuntu 16.04 (4.4.0-38-generic, x86_64 GNU/Linux)
GPU: GTX TITAN (GK110)
NVIDIA Driver Version: 370.28

njuffa · October 20, 2016, 6:19pm

@Gogar: Interesting. Have you filed a bug with NVIDIA for this?

andrea_nes · October 21, 2016, 8:37am

Hi,

Thanks so much for your comments and advise. I confirm nothing else is going on on the computer used for testing (e.g. other processes, licensing issues, GPU disabled). I will though follow up using nvidia-smi and profiler, and see what I can make out.

Gogar: thanks for the SGEMM calls tip, but I am not making any nor the underlying Thrust code is (I am using Thrust for some sorting). Also I confirm that initially, after migrating to CUDA 8.0, the performance was confirmed to be consistent with what expected, and only later I have experienced a drop in performance.

I will investigate further and post back anything I can find.

Thanks for your tips and your time,

Best Regards,

Andrea

Topic		Replies	Views
CUDA very slow performance CUDA Programming and Performance	21	16767	March 6, 2020
Slow CUDA SGEMM CUDA Programming and Performance	5	651	September 15, 2022
Performance degradation in 7.0. Silly handling of constant memory in SASS vs 6.5 CUDA Programming and Performance	21	3586	April 2, 2015
cuBlas performance dramatically drops after some iterations CUDA Programming and Performance	4	895	January 18, 2015
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1928	January 12, 2019
Low or normal performance? CUDA Programming and Performance cuda	20	1229	November 13, 2020
First kernel execution takes longer CUDA Programming and Performance	8	2870	December 8, 2014
Unable to run several CUDA samples. CUDA Programming and Performance	2	824	April 1, 2019
Peformance comparison ends in strange results CUDA Programming and Performance	3	751	August 9, 2019
cuda 3.2 slower than cuda 2.0 ? CUDA Programming and Performance	11	4345	November 3, 2010

Sudden drop in CUDA/thrust perfomance

Related topics