Interesting holiday mystery: We are helping an org prove out GPUs in their data center, and while nvidia-smi successfully runs, any basic creation of a CUDA context / cudaMalloc fails. We suspect it’s around the vGPU setup, as our reference setup works on a google cloud RHEL 8.x node w/ similar install flow.
We’re unsure where to go from here, so any ideas welcome! It’s hard for us to move non-docker / packaged items to host, so if diagnostics ideas, ideally ones we can containerize. (Ex: we got cuda sample tests running via nvidia docker, but hard outside docker b/c rhel makes it hard to get the old gcc7 installed.)
Some lingering ideas:
– maybe 2Q is the wrong size? or we’re using the wrong license type?
– are there bios settings we need to check/tweak?
– maybe there is a way to test the cuda context at the hypervisor/rhel level that isn’t hard (e.g., no need for porting gcc toolchain)?
=====
It’s a tricky yet standard enterprise env, so we’d like to get this figured out as a template for future apps:
– V100 GPU
– esxi 6.7
– rhel 8.3
– vGPU 10.4 driver bundle (=> 440.121 vGPU manager + 440.118 linux driver)
– testing vGPU partition of size 2Q for headless compute tasks (CUDA → nvidia rapids)
– license manager is still being setup: we tried setting as Type 0 (unlicensed) and 1, 2
– docker w/ nvidia runtime set as default (docker 19.04, same versions as work on another rhel 8.3 gpu node)
Some diagnostics so far:
- License manager is currently disabled (type=0). We expected degraded-but-working perf for our latter steps due to this, but not full failure:
Errors are generally:
nvidia-gridd... Acquiring license (Quadro Virtual Data Center Workstation)
nvidia-gridd... Failed to acquire/renew license from license server... Requested feature was not found
- Hypervisor:
nvidia-smi
shows no CUDA version, which seems odd:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.121 Driver Version: 440.121 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:AF:00.0 Off | 0 |
| N/A 35C P0 26W / 250W | 39MiB / 16383MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
- Host: RHEL
nvidia-smi
does report CUDA, and the expected vGPU 10.4 release’s host/guest mismatched driver version, and oddly, no temperature/wattage:
440.118.02 / 440.118.02 / 10.2
0 GRID V100-2Q
P0, 2GB partition w/ 160MB allocated, but missing temp/wattage we see in the hypervisor’s nvidia-smi
More nvidia-smi details:
display mode enabled
display active disabled
persistence mode enabled
accounting mode disabled
driver model: current/pending n/a n/a
vbios 00.00.00.00.00
multigpu board: no (single gpu test node)
gpu part: n/a
gpu virt mode: vgpu
host vgpu mode: n/a
product name: quadro virtual data center workstation
license status: unlicensed
pci bus 0x02
gpu link info n/a
ecc mode enabled
- Docker tests: nvidia smi still works:
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:10.2 nvidia-smi
- CUDA tests: cuda samples (within nvidia docker)
./bandwidthTest
Starting…
Running on…
Device 0:…
Quick Mode
CUDA error at bandwidthTest.cu:686 code=46 (cudaErrorDevicesUnavailable) “cudaEventCreate(&start)”
./reduction
./reduction Starting…
GPU Device 0: “Volta” with compute capability 7.0
Using Device 0: GRID V100-2Q
Reducing array of type int
16777216 elements
256 threads (max)
64 blocks
CUDA error at reduction.cpp:492 code=46 ) cudaErrorDevicesUnavailable) “cudaMallo((void**)&d_idata, bytes)”
b) numba/cupy/cudf/etc fail on context / memory creation:
from numba import cuda
cuda.current_context() #fails