Did TensorFlow caused GPU memory crash?

First time I have to actually post on a forum, so please excuse me if this isn’t the right place for this issue.

I’ve been using tensorflow for deep learning without a problem for the last few weeks. A few days ago python quit unexpectedly while training a net. After this failure I haven’t been able to use the GPU with the exception of extremely small programs that require very little memory.

I believe perhaps something was damaged on the gpu. Here’s the most basic demonstration of how it’s failing:

  1. Run cuda-8.0/samples/0_Simple/VectAdd without any problems
  2. Run cuda-8.0/samples/0_Simple/MatrixMul and I get the following erros:

[Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “TITAN X (Pascal)” with compute capability 6.1

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Failed to synchronize on the stop event (error code an illegal memory access was encountered)!

This is what nvidia-smi shows:

±----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVS 310 Off | 0000:02:00.0 N/A | N/A |
| 30% 49C P0 N/A / N/A | 173MiB / 956MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 1 TITAN X (Pascal) Off | 0000:0B:00.0 Off | N/A |
| 24% 44C P8 11W / 250W | 1MiB / 12189MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
±----------------------------------------------------------------------------+

This is what ps aux | grep nvidia* shows

root 2889 0.1 0.0 0 0 ? S 15:42 0:00 [irq/44-nvidia]
root 2890 0.0 0.0 0 0 ? S 15:42 0:00 [nvidia]
root 2891 0.1 0.0 0 0 ? S 15:42 0:00 [irq/45-nvidia]
root 2892 0.0 0.0 0 0 ? S 15:42 0:00 [nvidia]
manuel 3699 0.0 0.0 11768 2148 pts/7 S+ 15:43 0:00 grep --color=auto nvidia*

and finally this is what nvidia-smi -q --id=1 shows:

==============NVSMI LOG==============

Timestamp : Fri Apr 21 15:45:36 2017
Driver Version : 375.26

Attached GPUs : 2
GPU 0000:0B:00.0
Product Name : TITAN X (Pascal)
Product Brand : GeForce
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0324416080938
GPU UUID : GPU-88604023-7765-cc1f-a071-5791997681d5
Minor Number : 1
VBIOS Version : 86.02.15.00.01
MultiGPU Board : No
Board ID : 0xb00
GPU Part Number : 900-1G611-2500-000
Inforom Version
Image Version : G001.0000.01.03
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU Virtualization Mode
Virtualization mode : None
PCI
Bus : 0x0B
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0010DE
Bus Id : 0000:0B:00.0
Sub System Id : 0x119A10DE
GPU Link Info
PCIe Generation
Max : 2
Current : 1
Link Width
Max : 16x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 23 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Sync Boost : Not Active
Unknown : Not Active
FB Memory Usage
Total : 12189 MiB
Used : 1 MiB
Free : 12188 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending : N/A
Temperature
GPU Current Temp : 39 C
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
Power Readings
Power Management : Supported
Power Draw : 10.46 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : 1417 MHz
Memory : 5005 MHz
Default Applications Clocks
Graphics : 1417 MHz
Memory : 5005 MHz
Max Clocks
Graphics : 1911 MHz
SM : 1911 MHz
Memory : 5005 MHz
Video : 1708 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Processes

Maybe related, when I created /etc/X11/xorg.conf to ensure only the first GPU was used as a graphics card, the screen became slow and almost non responsive. This means that the 1MiB that the second GPU is using is graphics, so perhaps there is some sort of memory leak?

Any help would be greatly appreciate it!

You don’t mention, did you try a reboot of the computer? Try that first.

After a reboot, try running gpu-burn and see if that reports any errors. If so, chances are that indeed the GPU might have some issue.

I’ve literally tried everything I can, including: reboot, re-install cuda drivers, upgrade to latest cuda driver,…

This is the latest test on gpu burn (DEVICE=0 is the 12GB TITAN X, and DEVICE=1 is the graphics GPU:

manuel@Manuel:~ export CUDA_VISIBLE_DEVICES=0 manuel@Manuel:~ cd software/gpu_burn-0.6/
manuel@Manuel:~/software/gpu_burn-0.6$ ./gpu_burn
Run length not specified in the command line. Burning for 10 secs
GPU 0: NVS 310 (UUID: GPU-31c40fb3-9f72-c669-a624-4c7a27bd1d38)
GPU 1: TITAN X (Pascal) (UUID: GPU-88604023-7765-cc1f-a071-5791997681d5)
Initialized device 0 with 12189 MB of memory (12003 MB available, using 10802 MB of it), using FLOATS
Failure during compute: Error in “SGEMM”: CUBLAS_STATUS_INTERNAL_ERROR
0.0% proc’d: -1 errors: -1 (DIED!) temps: 40 C

No clients are alive! Aborting

This fails on CUBLAS internal error, but this isn’t new since if I try to run one of the sample scripts provided by NVIDIA (namely, matrixMulCUBLAS) I get:

manuel@Manuel:/usr/local/cuda-8.0/samples/0_Simple/matrixMulCUBLAS$ ./matrixMulCUBLAS
[Matrix Multiply CUBLAS] - Starting…
GPU Device 0: “TITAN X (Pascal)” with compute capability 6.1

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS…done.
CUDA error at matrixMulCUBLAS.cpp:303 code=77(cudaErrorIllegalAddress) “cudaEventSynchronize(stop)”

UPDATE: It turns out the GPU fails on all scripts. For example, running the NVIDIA example script vectorAdd I get

Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Failed to copy vector C from device to host (error code an illegal memory access was encountered)!

illegal memory access seems like the trend. Perhaps there is something wrong with the hardware :(

Is this happening ONLY on the Titan X or also on the NVS 310? If it happens on the NVS 310 as well, it would sound like perhaps some software/driver issue.

Another thread suggested that deleting files related to JIT-caching resolved some errors when running CUDA apps:
https://devtalk.nvidia.com/default/topic/1003878/cuda-setup-and-installation/problem-with-cuda-8-with-381-09-drivers-on-ubuntu-16-04-gtx-1080ti/

(the link below explains more on that concept in case you are curious):
https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/

My other suggestion is to test on Windows with the same hardware. Load any version of Windows… 7, 8.1, or 10 and test CUDA with NVIDIA drivers on a fresh install.

The plot thickens…

I tried the options you suggested, but so far nothing seems to work. However…

It seems that sometimes the gpu burn test is OK for both GPUs, but when if I run it again the test fails on the TITAN X…

Also, I have a software which uses a GPU, and after re-installing the driver this software appears to work correctly and does not fail. (I haven’t checked that the results are 100% correct, but they seem fine). How can this be possible? If I run the sample which comes with the NVIDA driver:

manuel@Manuel:/usr/local/cuda-8.0/samples/0_Simple/matrixMulCUBLAS$ ./matrixMulCUBLAS
[Matrix Multiply CUBLAS] - Starting…
GPU Device 0: “TITAN X (Pascal)” with compute capability 6.1

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS…done.
CUDA error at matrixMulCUBLAS.cpp:303 code=77(cudaErrorIllegalAddress) “cudaEventSynchronize(stop)”

I’m baffle, but honestly I don’t have enough experience with GPUs to understand it either…

Your last post is very generic, so I can’t offer much advice…

“It seems that sometimes the gpu burn test is OK for both GPUs, but when if I run it again the test fails on the TITAN X”

  1. Does that mean that the gpu-burn program runs and encounters errors on the Titan X? Or does that mean it fails like it did on post #9 where it never runs?

  2. What is the output of gpu burn when run ONLY on NVS310?

  3. What is the output of any other CUDA sample run ONLY on NVS310?

“Also, I have a software which uses a GPU, and after re-installing the driver this software appears to work correctly and does not fail.”

Which software is this? Are you sure it’s actually running on the Titan X and not on the NVS310?