lromor
July 14, 2022, 9:15am
1
Hi, I’m using torchserve and they rely on nvml to monitor some GPU metrics. For the GPU we have it looks like the driver is not exporting that information. Is this a the expected behavior? Depending on your answer we’ll have to modify torchserve code or wait for a bugfix on your side.
The related github issue (with more info) can be found here:
opened 01:16PM - 04 Jul 22 UTC
closed 10:13PM - 26 Aug 22 UTC
bug
p1
### 🐛 Describe the bug
Sometimes it can occur that NVML does not support monito… ring queries to specific devices. Currently this leads to failing the startup phase.
### Error logs
```
2022-07-04T12:33:15,023 [ERROR] Thread-20 org.pytorch.serve.metrics.MetricCollector - Traceback (most recent call last):
File "ts/metrics/metric_collector.py", line 27, in <module>
system_metrics.collect_all(sys.modules['ts.metrics.system_metrics'], arguments.gpu)
File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 91, in collect_all
value(num_of_gpu)
File "/usr/local/lib/python3.6/dist-packages/ts/metrics/system_metrics.py", line 72, in gpu_utilization
statuses = list_gpus.device_statuses()
File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in device_statuses
return [device_status(device_index) for device_index in range(device_count)]
File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 67, in <listcomp>
return [device_status(device_index) for device_index in range(device_count)]
File "/usr/local/lib/python3.6/dist-packages/nvgpu/list_gpus.py", line 26, in device_status
temperature = nv.nvmlDeviceGetTemperature(handle, nv.NVML_TEMPERATURE_GPU)
File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 1956, in nvmlDeviceGetTemperature
_nvmlCheckReturn(ret)
File "/usr/local/lib/python3.6/dist-packages/pynvml/nvml.py", line 765, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.nvml.NVMLError_NotSupported: Not Supported
```
### Installation instructions
pytorch/torchserve:latest-gpu
### Model Packaing
N/A
### config.properties
_No response_
### Versions
```
------------------------------------------------------------------------------------------
Environment headers
------------------------------------------------------------------------------------------
Torchserve branch:
torchserve==0.6.0
torch-model-archiver==0.6.0
Python version: 3.6 (64-bit runtime)
Python executable: /usr/bin/python3
Versions of relevant python libraries:
future==0.18.2
numpy==1.19.5
nvgpu==0.9.0
psutil==5.9.1
requests==2.27.1
torch-model-archiver==0.6.0
torch-workflow-archiver==0.2.4
torchserve==0.6.0
wheel==0.30.0
**Warning: torch not present ..
**Warning: torchtext not present ..
**Warning: torchvision not present ..
**Warning: torchaudio not present ..
Java Version:
OS: N/A
GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Clang version: N/A
CMake version: N/A
```
### Repro instructions
run:
```
torchserve --start --foreground --model-store model-store/
```
### Possible Solution
Deal with those exceptions.