Maxwell suddernly becomes 10x slower

FangQ · February 17, 2016, 12:06am

I am developing a GPU Monte Carlo photon simulator on a testing machine. The machine runs Ubuntu 14.04.3, with 3 NVIDIA graphics cards: GTX 980Ti (maxwell), GTX 590 (fermi) and GTX 730 (kepler). The driver is 352.63, and I installed CUDA 7.5.6. The linux kernel is 3.13.0-57-generic.

My code has been running 4x faster on the Maxwell than one core of 590, and 16x faster than the 730 (13000 vs. 3000 vs. 800 photon/ms). However, for a couple times, the 980Ti’s simulation speed can drop by 10 fold for no reasons. The speed for 590/730 are not affected. When this happens, when I run Nvidia visual profiler (nvvp) on 980Ti, some of the tests take long time to complete, and some simply return an error:

“Insufficient Kernel Bounds Data: The data needed to calculate … could not be collected”

All nvvp tests pass nicely when the card works at the full speed.

Previously, I was able to get the full speed back after rebooting my computer. However, the recent occurrence of this issue could not be solved by rebooting. I even removed the 730 and make sure the rest cards access to more power, but nothing was changed.

Here are my questions:

is there a way I can “reset” the 980Ti in case it got stuck in a strange state?
how do I know the 980Ti is not malfunction? I used nvidia-smi, the output is attached below, see anything wrong?

My code is open-source and can be found at GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator and check out at

svn checkout https://svn.code.sf.net/p/mcx/svn/mcextreme_cuda/trunk/ mcx

You simply go to mcx/src and type “make”, and cd mcx/example/quicktest and run the script run_qtest.sh. The speed is printed near the end of the log.

If anyone has a 980Ti, can you let me know what speed you are getting?

~$ nvidia-smi
Tue Feb 16 18:39:24 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 590     Off  | 0000:03:00.0     N/A |                  N/A |
|  0%   70C    P0    N/A /  N/A |    170MiB /  1535MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 590     Off  | 0000:04:00.0     N/A |                  N/A |
| 46%   50C   P12    N/A /  N/A |      5MiB /  1535MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 980 Ti  Off  | 0000:05:00.0     Off |                  N/A |
| 21%   68C    P2   140W / 250W |    170MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
|    2      8535    C   /home/fangq/space/git/Project/mcx/bin/mcx      148MiB |
+-----------------------------------------------------------------------------+

Robert_Crovella · February 17, 2016, 12:23am

CUDA 7.5.6 is a RC version of CUDA. There is no reason that I can think of to use that at this point. Use CUDA 7.5.18 (nvcc will report 7.5.17). Also I would suggest that you install the 352.79 driver, or else the 361.28 driver.

Also, your GTX 980 Ti is running in P2 state with a compute process running. You can try modifying the state via modification of application clocks using nvidia-smi. I don’t think this should account for a 10x perf issue, however.

[url]https://devtalk.nvidia.com/default/topic/892842/cuda-programming-and-performance/one-weird-trick-to-get-a-maxwell-v2-gpu-to-reach-its-max-memory-clock-/[/url]

The full output from nvidia-smi -a may be more useful for analysis of performance than just nvidia-smi.

allanmac · February 17, 2016, 12:34am

I’ve had my GPUs go into a low-clock state but only after repeatedly subjecting them to crashing kernels and many many Nsight debugging sessions.

If you have a utility like GPU-Z, you can see the GPU/MEM clocks are very low and do not increase.

Once a card is in that state it seems to stay there so I would also be mildly interested in a command-line reset incantation.

I’m going to guess that the driver puts the card into a low-clock state by design.

A reboot always solves the problem. :)

Robert_Crovella · February 17, 2016, 12:48am

I don’t think I’ve witnessed that, but if I did, I would try either:

sudo rmmod nvidia

or

nvidia-smi -i x -r

where x is the GPU ID as listed by nvidia-smi

(I was thinking linux here. I think OP’s situation is linux.)

allanmac · February 17, 2016, 12:59am

Cool, I will try next time.

I’ll repeat that it’s rare and I’ve only seen it after I’ve thoroughly abused the card and driver including surviving WDDM TDRs.

njuffa · February 17, 2016, 2:52am

My experience is that the CUDA driver can recover from WDDM TDRs, but not an unlimited number of times. I suspect that each WDDM TDR recovery leaks some hardware resource somewhere until the resource pool is used up, at which point really weird stuff happens and it is time to power cycle the system.

At least that has been my observation in the past, I have no deeper insight into the TDR recovery process. As I recall, different driver generations showed different amount of resilience to repeated WDDM TDR events.

FangQ · February 18, 2016, 4:55pm

Thank you all for your prompt and helpful comments. Over the past few days, I’ve been trying different things, following your suggestions.

First, I upgraded my cuda and nvidia drivers to the latest (7.5.18 and 352.79). Unfortunately this did not change the low performance issue on the maxwell.

I also tried the “nvidia-smi -a” command to investigate the P2 state issues. I found the Maxwell card was indeed running at a lower clock (1303 MHz over max 1493 MHz, should I look at the Application Clocks or just Clocks?). My nvidia-smi logs were documented here

[url]Maxwell GPU may get locked in P2 state when running mcx · Issue #18 · fangq/mcx · GitHub

I tried the following command:

sudo nvidia-smi -i 2 -ac 3505,1493

and reran my simulation, nvidia-smi did show that the Maxwell recovered at P0 state. However, the simulation speed was not changed :( after rebooting my computer, the Maxwell returns to P2 state.

I documented some of my notes here

[url]Maxwell GPU may get locked in P2 state when running mcx · Issue #18 · fangq/mcx · GitHub

wondering if any one can take a look at the nvidia-smi -a log and let me know if anything else looks suspicious? when I run my simulation on the 590 GPUs, nvidia-smi -a could not print any of the clock values (the clocks are all N/A).

Also, wondering if anyone can run my test script (detailed in the original post) and let me know the simulation speed (photon/ms) on your maxwell card.

thanks

FangQ · February 18, 2016, 4:57pm

I also tried this command, but I received the following error:

fangq@wazu$ sudo nvidia-smi -i 2 -r

Unable to reset this GPU because it’s being used by some other process (e.g. CUDA application, graphics application like X server, monitoring application like other instance of nvidia-smi). Please first kill all processes using this GPU and all compute applications running in the system (even when they are running on other GPUs) and then try to reset the GPU again.
Terminating early due to previous errors.

the 980Ti is not connected to a monitor and I do not have anything running on it.

FangQ · February 23, 2016, 9:17pm

Just to make another comment. I was able to run the “nvidia-smi -i 2 -r” command without the X (in the recovery mode). nvidia-smi did confirm that the GPU device was successfully reset. However, when I boot my machine again, the running speed of my simulator remains low, and nvidia-smi remains show P2 state for the Maxwell.

any other ways to find out what’s wrong this this card?

I don’t think I have abused the card with my simulations. I can only think of occasional "Ctrl+C"s after launching something that I did not want to run. I didn’t know if that can hurt the card.

CudaaduC · February 23, 2016, 9:32pm

You certainly did not abuse the GPU by using ctrl-c.

I am familiar with your code, and think since you wrote MCX using Fermi/Kepler GPU your implementation maps better to that architecture than Maxwell.
I also wrote a GPU based monte carlo simluation for optical photons in turbid media (with fluorescence and complex shapes) specifically targeting Maxwell, and was able to get a nice speedup over MCX. In general it helps to not use C++ classes (break down into primitive CUDA aligned types like int4 etc). Also MCX has more branch divergence which I worked around for our implementation. Also used the 16-bit ‘half’ type for some values (CUDA 7.5) and used cuRAND for random number generation, which is slightly slower than your approach in MCX.

In general MCX is a nice simulation, but some updating will be needed to fully utilize the capabilities of the Maxwell generation

FangQ · February 23, 2016, 10:07pm

Here is another update. I found a GTX 980 (not 980Ti) from my colleague, and was able to run the latest mcx. The benchmark reported a speed of 14306 photon/ms. In comparison, my currently broken 980Ti only returns 1250 photon/ms at P2 state. looks like my 980Ti mysteriously became 11 times slower than a 980!

I also opened the NVIDIA x server settings, clicked on the PowerMizer page of the 980Ti, and watched it during the execution of my code. The screen capture of the PowerMizer is attached. When the code is running, the Level moved to 2, and stay there until the end of the simulation. It looks like the card was indeed not running at the maximum speed.

I changed the Preferred mode from Auto to Maximum Performance, but the simulation speed remains slow.

FangQ · February 23, 2016, 10:29pm

glad that you chimed in. I am also looking to learn more optimizations you’ve done over my code.

regarding your comment above, though, I am not sure if it remains true for the latest version of mcx.

Over the past year, significant changes have been made to mcx. A new precise ray-grid ray-tracing algorithm was implemented, and the simulation results are now significantly more accurate for heterogeneous media. During this process, I made an interesting observation, the new ray-tracer slows down the code by 20-30% on Fermi/Keplers, but somehow, combining with all other changes, made the code 20% faster on a Maxwell!

With the old mcx, my 980Ti was only about 2.5x faster than a Fermi (590 one GPU), but with the new code, it became 5-6x faster! (of course, before it broke)

I am very curious, would it be possible for you to check out the latest mcx and run the benchmark under mcx/example/quicktest/ ? In an earlier post, I reported a 14300 photon/ms on a 980, and expect >=16000 photon/ms for a functional 980Ti. With your modified mcx, were you able to exceed this mark for the same problem?

the latest mcx code can be checked out at (the github version is not new enough)

svn checkout https://svn.code.sf.net/p/mcx/svn/mcextreme_cuda/trunk/ mcx

many of these sound interesting and I am looking forward to collaborations. will follow up with PMs.

CudaaduC · February 23, 2016, 11:18pm

You certainly did not abuse the GPU by using ctrl-c.

I am familiar with your code, and think since you wrote MCX using Fermi/Kepler GPU your implementation maps better to that architecture than Maxwell.

glad that you chimed in. I am also looking to learning more optimizations you’ve done over my code.

regarding your comment above, though, I am not sure if it remains true for the latest version of mcx.

Over the past year, significant changes have been made to mcx. A new precise ray-grid ray-tracing algorithm was implemented, and the simulation results are now significantly more accurate for heterogeneous media. During this process, I made an interesting observation, the new ray-tracer slows down the code by 20-30% on Fermi/Keplers, but somehow, combining with all other changes, made the code 20% faster on a Maxwell!

With the old mcx, my 980Ti was only about 2.5x faster than a Fermi (590 one GPU), but with the new code, it became 5-6x faster! (of course, before it broke)

I am very curious, would it be possible for you to check out the latest mcx and run the benchmark under mcx/example/quicktest/ ? In an earlier post, I reported a 14300 photon/ms on a 980, and expect >=16000 photon/ms for a functioning 980Ti. With your modified mcx, were you able to exceed this mark for the same problem?

the latest mcx code can be checked out at (the github version is not new enough)
svn checkout https://svn.code.sf.net/p/mcx/svn/mcextreme_cuda/trunk/ mcx
I also wrote a GPU based monte carlo simluation for optical photons in turbid media (with fluorescence and complex shapes) specifically targeting Maxwell, and was able to get a nice speedup over MCX. In general it helps to not use C++ classes (break down into primitive CUDA aligned types like int4 etc). Also MCX has more branch divergence which I worked around for our implementation. Also used the 16-bit ‘half’ type for some values (CUDA 7.5) and used cuRAND for random number generation, which is slightly slower than your approach in MCX.

In general MCX is a nice simulation, but some updating will be needed to fully utilize the capabilities of the Maxwell generation

many of these sound interesting and I am looking forward to collaborations. will follow up with PMs.

Thanks! Have not tried this latest version, and hope to get it going over the weekend (deep into sparse matrices ATM).
The simulation implementation I developed is a bit more specialized than MCX for specific types of simulations and used by our physicists. Not as versatile but very fast for these particular simulations.

Very positive that you are actively are updating the code. Not sure why you are having issues with the GTX 980ti.
We have been using both the Kepler Titan and a number of the Maxwell GTX 980 GPUs for our simulations, so cannot duplicated your GTX 980ti tests, but will test on a GTX Titan X with a slower clock of 1.2 GHz.

mfatica · February 24, 2016, 1:27am

The slowdown may be due to UVM.
Try to set CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 or to set CUDA_VISIBLE_DEVICES pointing to only the 980Ti.

FangQ · February 24, 2016, 4:07am

tried both, the setting seemed to be effective (mcx -L only lists the selected GPU), however, slow speed remained the same.

perhaps it is time to call nvidia customer support and exercise the warranty …

mfatica · February 24, 2016, 5:30am

One more thing to try, move the card in a different slot. Right now it is running at x8 gen2, if you are moving a lot of data
it may slow down your computations.

Topic		Replies	Views
One weird trick to get a Maxwell v2 GPU to reach its max memory clock ! CUDA Programming and Performance	59	18672	April 22, 2016
CUDA 7.5 on Maxwell 980Ti drops performance by 10x versus CUDA 7.0, and 6.5 CUDA Programming and Performance	46	7302	October 11, 2016
Any advice on adjusting code for Maxwell when coming from Kepler CUDA Programming and Performance	20	2940	November 6, 2014
So what's new about Maxwell? CUDA Programming and Performance	166	57390	March 10, 2015
Titan X (with latest drivers) slower than Titan Black with older drivers CUDA Programming and Performance	45	11178	October 13, 2015
GTX 1070 CUDA/Mem performance thread CUDA Programming and Performance	5	15294	August 8, 2016
My GPU Became Slower... after 1 month of not testing cuda CUDA Programming and Performance	18	12385	August 23, 2010
App for monitoring/changing GPU clock rate Very efficient - compared to Teapot :) CUDA Programming and Performance	23	106556	July 1, 2010
TitanX slower than CPU (Tensorflow), possible configuration issue CUDA Programming and Performance	9	4603	April 13, 2016
What's new in Maxwell 'sm_52' (GTX 9xx) ? CUDA Programming and Performance	69	27408	December 23, 2014

Maxwell suddernly becomes 10x slower

Related topics