Maxwell suddernly becomes 10x slower

I am developing a GPU Monte Carlo photon simulator on a testing machine. The machine runs Ubuntu 14.04.3, with 3 NVIDIA graphics cards: GTX 980Ti (maxwell), GTX 590 (fermi) and GTX 730 (kepler). The driver is 352.63, and I installed CUDA 7.5.6. The linux kernel is 3.13.0-57-generic.

My code has been running 4x faster on the Maxwell than one core of 590, and 16x faster than the 730 (13000 vs. 3000 vs. 800 photon/ms). However, for a couple times, the 980Ti’s simulation speed can drop by 10 fold for no reasons. The speed for 590/730 are not affected. When this happens, when I run Nvidia visual profiler (nvvp) on 980Ti, some of the tests take long time to complete, and some simply return an error:

“Insufficient Kernel Bounds Data: The data needed to calculate … could not be collected”

All nvvp tests pass nicely when the card works at the full speed.

Previously, I was able to get the full speed back after rebooting my computer. However, the recent occurrence of this issue could not be solved by rebooting. I even removed the 730 and make sure the rest cards access to more power, but nothing was changed.

Here are my questions:

  • is there a way I can “reset” the 980Ti in case it got stuck in a strange state?

  • how do I know the 980Ti is not malfunction? I used nvidia-smi, the output is attached below, see anything wrong?

My code is open-source and can be found at GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator and check out at

svn checkout https://svn.code.sf.net/p/mcx/svn/mcextreme_cuda/trunk/ mcx

You simply go to mcx/src and type “make”, and cd mcx/example/quicktest and run the script run_qtest.sh. The speed is printed near the end of the log.

If anyone has a 980Ti, can you let me know what speed you are getting?

~$ nvidia-smi
Tue Feb 16 18:39:24 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.63     Driver Version: 352.63         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 590     Off  | 0000:03:00.0     N/A |                  N/A |
|  0%   70C    P0    N/A /  N/A |    170MiB /  1535MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 590     Off  | 0000:04:00.0     N/A |                  N/A |
| 46%   50C   P12    N/A /  N/A |      5MiB /  1535MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX 980 Ti  Off  | 0000:05:00.0     Off |                  N/A |
| 21%   68C    P2   140W / 250W |    170MiB /  6143MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0                  Not Supported                                         |
|    1                  Not Supported                                         |
|    2      8535    C   /home/fangq/space/git/Project/mcx/bin/mcx      148MiB |
+-----------------------------------------------------------------------------+

CUDA 7.5.6 is a RC version of CUDA. There is no reason that I can think of to use that at this point. Use CUDA 7.5.18 (nvcc will report 7.5.17). Also I would suggest that you install the 352.79 driver, or else the 361.28 driver.

Also, your GTX 980 Ti is running in P2 state with a compute process running. You can try modifying the state via modification of application clocks using nvidia-smi. I don’t think this should account for a 10x perf issue, however.

[url]https://devtalk.nvidia.com/default/topic/892842/cuda-programming-and-performance/one-weird-trick-to-get-a-maxwell-v2-gpu-to-reach-its-max-memory-clock-/[/url]

The full output from nvidia-smi -a may be more useful for analysis of performance than just nvidia-smi.

I’ve had my GPUs go into a low-clock state but only after repeatedly subjecting them to crashing kernels and many many Nsight debugging sessions.

If you have a utility like GPU-Z, you can see the GPU/MEM clocks are very low and do not increase.

Once a card is in that state it seems to stay there so I would also be mildly interested in a command-line reset incantation.

I’m going to guess that the driver puts the card into a low-clock state by design.

A reboot always solves the problem. :)

I don’t think I’ve witnessed that, but if I did, I would try either:

sudo rmmod nvidia

or

nvidia-smi -i x -r

where x is the GPU ID as listed by nvidia-smi

(I was thinking linux here. I think OP’s situation is linux.)

Cool, I will try next time.

I’ll repeat that it’s rare and I’ve only seen it after I’ve thoroughly abused the card and driver including surviving WDDM TDRs.

My experience is that the CUDA driver can recover from WDDM TDRs, but not an unlimited number of times. I suspect that each WDDM TDR recovery leaks some hardware resource somewhere until the resource pool is used up, at which point really weird stuff happens and it is time to power cycle the system.

At least that has been my observation in the past, I have no deeper insight into the TDR recovery process. As I recall, different driver generations showed different amount of resilience to repeated WDDM TDR events.

Thank you all for your prompt and helpful comments. Over the past few days, I’ve been trying different things, following your suggestions.

First, I upgraded my cuda and nvidia drivers to the latest (7.5.18 and 352.79). Unfortunately this did not change the low performance issue on the maxwell.

I also tried the “nvidia-smi -a” command to investigate the P2 state issues. I found the Maxwell card was indeed running at a lower clock (1303 MHz over max 1493 MHz, should I look at the Application Clocks or just Clocks?). My nvidia-smi logs were documented here

[url]Maxwell GPU may get locked in P2 state when running mcx · Issue #18 · fangq/mcx · GitHub

I tried the following command:

sudo nvidia-smi -i 2 -ac 3505,1493

and reran my simulation, nvidia-smi did show that the Maxwell recovered at P0 state. However, the simulation speed was not changed :( after rebooting my computer, the Maxwell returns to P2 state.

I documented some of my notes here

[url]Maxwell GPU may get locked in P2 state when running mcx · Issue #18 · fangq/mcx · GitHub

wondering if any one can take a look at the nvidia-smi -a log and let me know if anything else looks suspicious? when I run my simulation on the 590 GPUs, nvidia-smi -a could not print any of the clock values (the clocks are all N/A).

Also, wondering if anyone can run my test script (detailed in the original post) and let me know the simulation speed (photon/ms) on your maxwell card.

thanks

I also tried this command, but I received the following error:

fangq@wazu$ sudo nvidia-smi -i 2 -r

Unable to reset this GPU because it’s being used by some other process (e.g. CUDA application, graphics application like X server, monitoring application like other instance of nvidia-smi). Please first kill all processes using this GPU and all compute applications running in the system (even when they are running on other GPUs) and then try to reset the GPU again.
Terminating early due to previous errors.

the 980Ti is not connected to a monitor and I do not have anything running on it.

Just to make another comment. I was able to run the “nvidia-smi -i 2 -r” command without the X (in the recovery mode). nvidia-smi did confirm that the GPU device was successfully reset. However, when I boot my machine again, the running speed of my simulator remains low, and nvidia-smi remains show P2 state for the Maxwell.

any other ways to find out what’s wrong this this card?

I don’t think I have abused the card with my simulations. I can only think of occasional "Ctrl+C"s after launching something that I did not want to run. I didn’t know if that can hurt the card.

You certainly did not abuse the GPU by using ctrl-c.

I am familiar with your code, and think since you wrote MCX using Fermi/Kepler GPU your implementation maps better to that architecture than Maxwell.
I also wrote a GPU based monte carlo simluation for optical photons in turbid media (with fluorescence and complex shapes) specifically targeting Maxwell, and was able to get a nice speedup over MCX. In general it helps to not use C++ classes (break down into primitive CUDA aligned types like int4 etc). Also MCX has more branch divergence which I worked around for our implementation. Also used the 16-bit ‘half’ type for some values (CUDA 7.5) and used cuRAND for random number generation, which is slightly slower than your approach in MCX.

In general MCX is a nice simulation, but some updating will be needed to fully utilize the capabilities of the Maxwell generation

Here is another update. I found a GTX 980 (not 980Ti) from my colleague, and was able to run the latest mcx. The benchmark reported a speed of 14306 photon/ms. In comparison, my currently broken 980Ti only returns 1250 photon/ms at P2 state. looks like my 980Ti mysteriously became 11 times slower than a 980!

I also opened the NVIDIA x server settings, clicked on the PowerMizer page of the 980Ti, and watched it during the execution of my code. The screen capture of the PowerMizer is attached. When the code is running, the Level moved to 2, and stay there until the end of the simulation. It looks like the card was indeed not running at the maximum speed.

I changed the Preferred mode from Auto to Maximum Performance, but the simulation speed remains slow.

glad that you chimed in. I am also looking to learn more optimizations you’ve done over my code.

regarding your comment above, though, I am not sure if it remains true for the latest version of mcx.

Over the past year, significant changes have been made to mcx. A new precise ray-grid ray-tracing algorithm was implemented, and the simulation results are now significantly more accurate for heterogeneous media. During this process, I made an interesting observation, the new ray-tracer slows down the code by 20-30% on Fermi/Keplers, but somehow, combining with all other changes, made the code 20% faster on a Maxwell!

With the old mcx, my 980Ti was only about 2.5x faster than a Fermi (590 one GPU), but with the new code, it became 5-6x faster! (of course, before it broke)

I am very curious, would it be possible for you to check out the latest mcx and run the benchmark under mcx/example/quicktest/ ? In an earlier post, I reported a 14300 photon/ms on a 980, and expect >=16000 photon/ms for a functional 980Ti. With your modified mcx, were you able to exceed this mark for the same problem?

the latest mcx code can be checked out at (the github version is not new enough)

svn checkout https://svn.code.sf.net/p/mcx/svn/mcextreme_cuda/trunk/ mcx

many of these sound interesting and I am looking forward to collaborations. will follow up with PMs.

Thanks! Have not tried this latest version, and hope to get it going over the weekend (deep into sparse matrices ATM).
The simulation implementation I developed is a bit more specialized than MCX for specific types of simulations and used by our physicists. Not as versatile but very fast for these particular simulations.

Very positive that you are actively are updating the code. Not sure why you are having issues with the GTX 980ti.
We have been using both the Kepler Titan and a number of the Maxwell GTX 980 GPUs for our simulations, so cannot duplicated your GTX 980ti tests, but will test on a GTX Titan X with a slower clock of 1.2 GHz.

The slowdown may be due to UVM.
Try to set CUDA_MANAGED_FORCE_DEVICE_ALLOC=1 or to set CUDA_VISIBLE_DEVICES pointing to only the 980Ti.

tried both, the setting seemed to be effective (mcx -L only lists the selected GPU), however, slow speed remained the same.

perhaps it is time to call nvidia customer support and exercise the warranty …

One more thing to try, move the card in a different slot. Right now it is running at x8 gen2, if you are moving a lot of data
it may slow down your computations.