I am developing a GPU Monte Carlo photon simulator on a testing machine. The machine runs Ubuntu 14.04.3, with 3 NVIDIA graphics cards: GTX 980Ti (maxwell), GTX 590 (fermi) and GTX 730 (kepler). The driver is 352.63, and I installed CUDA 7.5.6. The linux kernel is 3.13.0-57-generic.
My code has been running 4x faster on the Maxwell than one core of 590, and 16x faster than the 730 (13000 vs. 3000 vs. 800 photon/ms). However, for a couple times, the 980Ti’s simulation speed can drop by 10 fold for no reasons. The speed for 590/730 are not affected. When this happens, when I run Nvidia visual profiler (nvvp) on 980Ti, some of the tests take long time to complete, and some simply return an error:
“Insufficient Kernel Bounds Data: The data needed to calculate … could not be collected”
All nvvp tests pass nicely when the card works at the full speed.
Previously, I was able to get the full speed back after rebooting my computer. However, the recent occurrence of this issue could not be solved by rebooting. I even removed the 730 and make sure the rest cards access to more power, but nothing was changed.
Here are my questions:
-
is there a way I can “reset” the 980Ti in case it got stuck in a strange state?
-
how do I know the 980Ti is not malfunction? I used nvidia-smi, the output is attached below, see anything wrong?
My code is open-source and can be found at GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator and check out at
svn checkout https://svn.code.sf.net/p/mcx/svn/mcextreme_cuda/trunk/ mcx
You simply go to mcx/src and type “make”, and cd mcx/example/quicktest and run the script run_qtest.sh. The speed is printed near the end of the log.
If anyone has a 980Ti, can you let me know what speed you are getting?
~$ nvidia-smi
Tue Feb 16 18:39:24 2016
+------------------------------------------------------+
| NVIDIA-SMI 352.63 Driver Version: 352.63 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 590 Off | 0000:03:00.0 N/A | N/A |
| 0% 70C P0 N/A / N/A | 170MiB / 1535MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 590 Off | 0000:04:00.0 N/A | N/A |
| 46% 50C P12 N/A / N/A | 5MiB / 1535MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 2 GeForce GTX 980 Ti Off | 0000:05:00.0 Off | N/A |
| 21% 68C P2 140W / 250W | 170MiB / 6143MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
| 2 8535 C /home/fangq/space/git/Project/mcx/bin/mcx 148MiB |
+-----------------------------------------------------------------------------+