First time I have to actually post on a forum, so please excuse me if this isn’t the right place for this issue.
I’ve been using tensorflow for deep learning without a problem for the last few weeks. A few days ago python quit unexpectedly while training a net. After this failure I haven’t been able to use the GPU with the exception of extremely small programs that require very little memory.
I believe perhaps something was damaged on the gpu. Here’s the most basic demonstration of how it’s failing:
- Run cuda-8.0/samples/0_Simple/VectAdd without any problems
- Run cuda-8.0/samples/0_Simple/MatrixMul and I get the following erros:
[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “TITAN X (Pascal)” with compute capability 6.1
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Failed to synchronize on the stop event (error code an illegal memory access was encountered)!
This is what nvidia-smi shows:
±----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVS 310 Off | 0000:02:00.0 N/A | N/A |
| 30% 49C P0 N/A / N/A | 173MiB / 956MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 1 TITAN X (Pascal) Off | 0000:0B:00.0 Off | N/A |
| 24% 44C P8 11W / 250W | 1MiB / 12189MiB | 0% Default |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
±----------------------------------------------------------------------------+
This is what ps aux | grep nvidia* shows
root 2889 0.1 0.0 0 0 ? S 15:42 0:00 [irq/44-nvidia]
root 2890 0.0 0.0 0 0 ? S 15:42 0:00 [nvidia]
root 2891 0.1 0.0 0 0 ? S 15:42 0:00 [irq/45-nvidia]
root 2892 0.0 0.0 0 0 ? S 15:42 0:00 [nvidia]
manuel 3699 0.0 0.0 11768 2148 pts/7 S+ 15:43 0:00 grep --color=auto nvidia*
and finally this is what nvidia-smi -q --id=1 shows:
==============NVSMI LOG==============
Timestamp : Fri Apr 21 15:45:36 2017
Driver Version : 375.26
Attached GPUs : 2
GPU 0000:0B:00.0
Product Name : TITAN X (Pascal)
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324416080938
GPU UUID : GPU-88604023-7765-cc1f-a071-5791997681d5
Minor Number : 1
VBIOS Version : 86.02.15.00.01
MultiGPU Board : No
Board ID : 0xb00
GPU Part Number : 900-1G611-2500-000
Inforom Version
Image Version : G001.0000.01.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x0B
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0010DE
Bus Id : 0000:0B:00.0
Sub System Id : 0x119A10DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 23 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 12189 MiB
Used : 1 MiB
Free : 12188 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 39 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
Power Readings
Power Management : Supported
Power Draw : 10.46 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : 1417 MHz
Memory : 5005 MHz
Default Applications Clocks
Graphics : 1417 MHz
Memory : 5005 MHz
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5005 MHz
Video : 1708 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes
Maybe related, when I created /etc/X11/xorg.conf to ensure only the first GPU was used as a graphics card, the screen became slow and almost non responsive. This means that the 1MiB that the second GPU is using is graphics, so perhaps there is some sort of memory leak?
Any help would be greatly appreciate it!