The GPU process takes the GPU but there is not memory usage and the process hangs.

sushil.rajbanshi1 · October 23, 2018, 9:30pm

A python process scheduled on the GPU. That process seems to have allocated memory while the GPU utilization remains at 0%. The process hangs.

I just tried running this on dtadipa@lpdospml50107 and it has not progressed in the past twenty minutes:

$ python gpu_test.py

Specifying GPU

2018-10-23 14:06:40.081780: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-10-23 14:06:40.082365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: Tesla P100-PCIE-16GB major: 6 minor: 0 memoryClockRate(GHz): 1.3285
pciBusID: 0000:00:06.0
totalMemory: 15.90GiB freeMemory: 15.61GiB
2018-10-23 14:06:40.082408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2018-10-23 14:06:40.777651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-10-23 14:06:40.777719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2018-10-23 14:06:40.777742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2018-10-23 14:06:40.778181: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15135 MB memory) → physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 → device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0
2018-10-23 14:06:40.790282: I tensorflow/core/common_runtime/direct_session.cc:284] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 → device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:06.0, compute capability: 6.0

b/RandomStandardNormal: (RandomStandardNormal): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.791956: I tensorflow/core/common_runtime/placer.cc:886] b/RandomStandardNormal: (RandomStandardNormal)/job:localhost/replica:0/task:0/device:GPU:0
b/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.791988: I tensorflow/core/common_runtime/placer.cc:886] b/mul: (Mul)/job:localhost/replica:0/task:0/device:GPU:0
b: (Add): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792008: I tensorflow/core/common_runtime/placer.cc:886] b: (Add)/job:localhost/replica:0/task:0/device:GPU:0
a/RandomStandardNormal: (RandomStandardNormal): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792028: I tensorflow/core/common_runtime/placer.cc:886] a/RandomStandardNormal: (RandomStandardNormal)/job:localhost/replica:0/task:0/device:GPU:0
a/mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792047: I tensorflow/core/common_runtime/placer.cc:886] a/mul: (Mul)/job:localhost/replica:0/task:0/device:GPU:0
a: (Add): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792065: I tensorflow/core/common_runtime/placer.cc:886] a: (Add)/job:localhost/replica:0/task:0/device:GPU:0
c: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792084: I tensorflow/core/common_runtime/placer.cc:886] c: (MatMul)/job:localhost/replica:0/task:0/device:GPU:0
b/stddev: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792104: I tensorflow/core/common_runtime/placer.cc:886] b/stddev: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b/mean: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792122: I tensorflow/core/common_runtime/placer.cc:886] b/mean: (Const)/job:localhost/replica:0/task:0/device:GPU:0
b/shape: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792141: I tensorflow/core/common_runtime/placer.cc:886] b/shape: (Const)/job:localhost/replica:0/task:0/device:GPU:0
a/stddev: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792160: I tensorflow/core/common_runtime/placer.cc:886] a/stddev: (Const)/job:localhost/replica:0/task:0/device:GPU:0
a/mean: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792179: I tensorflow/core/common_runtime/placer.cc:886] a/mean: (Const)/job:localhost/replica:0/task:0/device:GPU:0
a/shape: (Const): /job:localhost/replica:0/task:0/device:GPU:0
2018-10-23 14:06:40.792197: I tensorflow/core/common_runtime/placer.cc:886] a/shape: (Const)/job:localhost/replica:0/task:0/device:GPU:0

nvidia-smi

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 11077 C python 343MiB |
±----------------------------------------------------------------------------+

TomNVIDIA · October 24, 2018, 3:39pm

Hello,

This is the community feedback forum. Please provide more information about your system and issue so that I can place it into the proper forum.

Thanks,
Tom