Holiday mystery - v100 vGPU grid proveout fails: Cannot create cuda contexts / alloc memory on most cuda sample kit entries (vmware/rhel/cuda10.2)

Interesting holiday mystery: We are helping an org prove out GPUs in their data center, and while nvidia-smi successfully runs, any basic creation of a CUDA context / cudaMalloc fails. We suspect it’s around the vGPU setup, as our reference setup works on a google cloud RHEL 8.x node w/ similar install flow.

We’re unsure where to go from here, so any ideas welcome! It’s hard for us to move non-docker / packaged items to host, so if diagnostics ideas, ideally ones we can containerize. (Ex: we got cuda sample tests running via nvidia docker, but hard outside docker b/c rhel makes it hard to get the old gcc7 installed.)

Some lingering ideas:
– maybe 2Q is the wrong size? or we’re using the wrong license type?
– are there bios settings we need to check/tweak?
– maybe there is a way to test the cuda context at the hypervisor/rhel level that isn’t hard (e.g., no need for porting gcc toolchain)?

=====

It’s a tricky yet standard enterprise env, so we’d like to get this figured out as a template for future apps:
– V100 GPU
– esxi 6.7
– rhel 8.3
– vGPU 10.4 driver bundle (=> 440.121 vGPU manager + 440.118 linux driver)
– testing vGPU partition of size 2Q for headless compute tasks (CUDA → nvidia rapids)
– license manager is still being setup: we tried setting as Type 0 (unlicensed) and 1, 2
– docker w/ nvidia runtime set as default (docker 19.04, same versions as work on another rhel 8.3 gpu node)

Some diagnostics so far:

  1. License manager is currently disabled (type=0). We expected degraded-but-working perf for our latter steps due to this, but not full failure:

Errors are generally:

nvidia-gridd... Acquiring license  (Quadro Virtual Data Center Workstation)
nvidia-gridd... Failed to acquire/renew license from license server... Requested feature was not found
  1. Hypervisor: nvidia-smi shows no CUDA version, which seems odd:
+-----------------------------------------------------------------------------+

| NVIDIA-SMI 440.121      Driver Version: 440.121      CUDA Version: N/A      |

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  Tesla V100-PCIE...  On   | 00000000:AF:00.0 Off |                    0 |

| N/A   35C    P0    26W / 250W |     39MiB / 16383MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+
  1. Host: RHEL nvidia-smi does report CUDA, and the expected vGPU 10.4 release’s host/guest mismatched driver version, and oddly, no temperature/wattage:

440.118.02 / 440.118.02 / 10.2
0 GRID V100-2Q

P0, 2GB partition w/ 160MB allocated, but missing temp/wattage we see in the hypervisor’s nvidia-smi

More nvidia-smi details:

display mode enabled
display active disabled
persistence mode enabled
accounting mode disabled
driver model: current/pending n/a n/a
vbios 00.00.00.00.00
multigpu board: no (single gpu test node)
gpu part: n/a
gpu virt mode: vgpu
host vgpu mode: n/a
product name: quadro virtual data center workstation
license status: unlicensed
pci bus 0x02
gpu link info n/a
ecc mode enabled

  1. Docker tests: nvidia smi still works:

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:10.2 nvidia-smi

  1. CUDA tests: cuda samples (within nvidia docker)

./bandwidthTest
Starting…
Running on…
Device 0:…
Quick Mode

CUDA error at bandwidthTest.cu:686 code=46 (cudaErrorDevicesUnavailable) “cudaEventCreate(&start)”

./reduction
./reduction Starting…
GPU Device 0: “Volta” with compute capability 7.0
Using Device 0: GRID V100-2Q
Reducing array of type int
16777216 elements
256 threads (max)
64 blocks

CUDA error at reduction.cpp:492 code=46 ) cudaErrorDevicesUnavailable) “cudaMallo((void**)&d_idata, bytes)”

b) numba/cupy/cudf/etc fail on context / memory creation:

from numba import cuda
cuda.current_context() #fails

Hi

The “Q” Profiles (QvDWS (Quadro Virtual Datacenter Workstation)) are the highest license tier and give the maximum functionality available. However you may also want to use the “C” Profiles (vCS (Virtual Compute Server)) as these are specifically for Compute focused workloads. They’re considerably cheaper and they’re licensed per GPU, not Per User like all other licenses. You can run up to 8x VMs on the same GPU with a single vCS license and they may be more appropriate for your workloads. However for the time being “Q” will be fine.

I’m not sure why you’re running vGPU 10.4? The 10.x branch is only supported until December 2020, meaning you have 3 days (at time of writing) of potential support remaining. If you use the current branch of 11.x (11.2 is most recent), this is an LTSB that runs to 2023 (not that you’d want to stay on the same driver for that long mind). 11.x also supports CUDA 11.0.

What you’re experiencing is correct, the CUDA version is not listed in the Hypervisor as CUDA workloads aren’t run from there. You’re installing a GPU manager in the Hypervisor, not an actual driver (of sorts), which is why you can see the CUDA version within the VM, as the VM driver contains CUDA.

The 2Q (2GB) Profile … You can of course use a 2GB Profile (if 2GB framebuffer is sufficient for your workload), however, depending on how many VMs you plan to run, you may benefit from changing the Scheduler mode to one that allows more consistent resource scheduling. Compute workloads are typically quite intensive, the Default Scheduler (Best Effort) will try and service all processing requests as they come in and for 3D / Graphical workloads it does a pretty good job, however it can easily become overloaded with multiple consistently high processing requests which will lead to inconsistent processing times, so perhaps switch to “Fixed” or “Equal” share mode to give more predicable performance. If you were running an A100, then this would be a different conversation due to MIG and SR-IOV.

Regarding CUDA performance, again, what you’re experiencing is correct. You need to get it licensed and then it will start to work. There is a severe drop off in functionality when the vGPUs are not licensed. Licensing should be the first element that is completed when deploying a vGPU environment as obtaining production licenses can sometimes take a while and cause delays. Evaluation licenses on the other hand are pretty quick to get hold of and if you are facing delays, you can get those yourself by signing up for a 90 day evaluation here: NVIDIA Enterprise Account Registration.

Hope that helps

Regards

MG

1 Like

Wanted to report we’re good now, this helped. It looked like the base issue was misunderstanding what it meant for degraded performance for unlicensed mode during setup. The app uses CUDA, and unlicensed mode full disables CUDA, vs degrading it, hence failed context creation. It was fine once we got licensing up!