Holiday mystery - v100 vGPU grid proveout fails: Cannot create cuda contexts / alloc memory on most cuda sample kit entries (vmware/rhel/cuda10.2)

leo8 · December 28, 2020, 9:35pm

Interesting holiday mystery: We are helping an org prove out GPUs in their data center, and while nvidia-smi successfully runs, any basic creation of a CUDA context / cudaMalloc fails. We suspect it’s around the vGPU setup, as our reference setup works on a google cloud RHEL 8.x node w/ similar install flow.

We’re unsure where to go from here, so any ideas welcome! It’s hard for us to move non-docker / packaged items to host, so if diagnostics ideas, ideally ones we can containerize. (Ex: we got cuda sample tests running via nvidia docker, but hard outside docker b/c rhel makes it hard to get the old gcc7 installed.)

Some lingering ideas:
– maybe 2Q is the wrong size? or we’re using the wrong license type?
– are there bios settings we need to check/tweak?
– maybe there is a way to test the cuda context at the hypervisor/rhel level that isn’t hard (e.g., no need for porting gcc toolchain)?

=====

It’s a tricky yet standard enterprise env, so we’d like to get this figured out as a template for future apps:
– V100 GPU
– esxi 6.7
– rhel 8.3
– vGPU 10.4 driver bundle (=> 440.121 vGPU manager + 440.118 linux driver)
– testing vGPU partition of size 2Q for headless compute tasks (CUDA → nvidia rapids)
– license manager is still being setup: we tried setting as Type 0 (unlicensed) and 1, 2
– docker w/ nvidia runtime set as default (docker 19.04, same versions as work on another rhel 8.3 gpu node)

Some diagnostics so far:

License manager is currently disabled (type=0). We expected degraded-but-working perf for our latter steps due to this, but not full failure:

Errors are generally:

nvidia-gridd... Acquiring license  (Quadro Virtual Data Center Workstation)
nvidia-gridd... Failed to acquire/renew license from license server... Requested feature was not found

Hypervisor: nvidia-smi shows no CUDA version, which seems odd:

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 440.121      Driver Version: 440.121      CUDA Version: N/A      |

|-------------------------------+----------------------+----------------------+

| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |

| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |

|===============================+======================+======================|

|   0  Tesla V100-PCIE...  On   | 00000000:AF:00.0 Off |                    0 |

| N/A   35C    P0    26W / 250W |     39MiB / 16383MiB |      0%      Default |

+-------------------------------+----------------------+----------------------+

Host: RHEL nvidia-smi does report CUDA, and the expected vGPU 10.4 release’s host/guest mismatched driver version, and oddly, no temperature/wattage:

440.118.02 / 440.118.02 / 10.2
0 GRID V100-2Q

P0, 2GB partition w/ 160MB allocated, but missing temp/wattage we see in the hypervisor’s nvidia-smi

More nvidia-smi details:

display mode enabled
display active disabled
persistence mode enabled
accounting mode disabled
driver model: current/pending n/a n/a
vbios 00.00.00.00.00
multigpu board: no (single gpu test node)
gpu part: n/a
gpu virt mode: vgpu
host vgpu mode: n/a
product name: quadro virtual data center workstation
license status: unlicensed
pci bus 0x02
gpu link info n/a
ecc mode enabled

Docker tests: nvidia smi still works:

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:10.2 nvidia-smi

CUDA tests: cuda samples (within nvidia docker)

./bandwidthTest
Starting…
Running on…
Device 0:…
Quick Mode

CUDA error at bandwidthTest.cu:686 code=46 (cudaErrorDevicesUnavailable) “cudaEventCreate(&start)”

./reduction
./reduction Starting…
GPU Device 0: “Volta” with compute capability 7.0
Using Device 0: GRID V100-2Q
Reducing array of type int
16777216 elements
256 threads (max)
64 blocks

CUDA error at reduction.cpp:492 code=46 ) cudaErrorDevicesUnavailable) “cudaMallo((void**)&d_idata, bytes)”

b) numba/cupy/cudf/etc fail on context / memory creation:

from numba import cuda
cuda.current_context() #fails

MrGRID · December 29, 2020, 10:40am

Hi

The “Q” Profiles (QvDWS (Quadro Virtual Datacenter Workstation)) are the highest license tier and give the maximum functionality available. However you may also want to use the “C” Profiles (vCS (Virtual Compute Server)) as these are specifically for Compute focused workloads. They’re considerably cheaper and they’re licensed per GPU, not Per User like all other licenses. You can run up to 8x VMs on the same GPU with a single vCS license and they may be more appropriate for your workloads. However for the time being “Q” will be fine.

I’m not sure why you’re running vGPU 10.4? The 10.x branch is only supported until December 2020, meaning you have 3 days (at time of writing) of potential support remaining. If you use the current branch of 11.x (11.2 is most recent), this is an LTSB that runs to 2023 (not that you’d want to stay on the same driver for that long mind). 11.x also supports CUDA 11.0.

What you’re experiencing is correct, the CUDA version is not listed in the Hypervisor as CUDA workloads aren’t run from there. You’re installing a GPU manager in the Hypervisor, not an actual driver (of sorts), which is why you can see the CUDA version within the VM, as the VM driver contains CUDA.

The 2Q (2GB) Profile … You can of course use a 2GB Profile (if 2GB framebuffer is sufficient for your workload), however, depending on how many VMs you plan to run, you may benefit from changing the Scheduler mode to one that allows more consistent resource scheduling. Compute workloads are typically quite intensive, the Default Scheduler (Best Effort) will try and service all processing requests as they come in and for 3D / Graphical workloads it does a pretty good job, however it can easily become overloaded with multiple consistently high processing requests which will lead to inconsistent processing times, so perhaps switch to “Fixed” or “Equal” share mode to give more predicable performance. If you were running an A100, then this would be a different conversation due to MIG and SR-IOV.

Regarding CUDA performance, again, what you’re experiencing is correct. You need to get it licensed and then it will start to work. There is a severe drop off in functionality when the vGPUs are not licensed. Licensing should be the first element that is completed when deploying a vGPU environment as obtaining production licenses can sometimes take a while and cause delays. Evaluation licenses on the other hand are pretty quick to get hold of and if you are facing delays, you can get those yourself by signing up for a 90 day evaluation here: NVIDIA Enterprise Account Registration.

Hope that helps

Regards

MG

leo8 · April 12, 2021, 7:45pm

Wanted to report we’re good now, this helped. It looked like the base issue was misunderstanding what it meant for degraded performance for unlicensed mode during setup. The app uses CUDA, and unlicensed mode full disables CUDA, vs degrading it, hence failed context creation. It was fine once we got licensing up!

Topic		Replies	Views
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver CUDA on Windows Subsystem for Linux	33	23137	May 1, 2021
Windows Server 2012 R2 - Cuda 6.5.19 and Cuda 7.0.28 - Erroor Code 38 CUDA Setup and Installation	12	2980	June 9, 2015
Nvidia-smi recognize H100 when Firmware is disable Confidential Computing cuda , ubuntu	10	595	September 11, 2024
470.14 - WSL with W10 Build 21343 - NVIDIA-SMI error CUDA on Windows Subsystem for Linux	43	19013	November 21, 2021
Nvidia-smi can't communicate with driver -- docker-desktop conflict? CUDA on Windows Subsystem for Linux cuda , wsl	3	2636	April 10, 2023
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1927	January 12, 2019
Warning: Cuda API error detected: cuModuleLoadFatBinary returned (0xd1) CUDA-GDB	25	1140	July 30, 2024
Test nvidia-smi by nvidia docker Jetson TX2	2	2231	October 18, 2021
CUDA 2.1 discussion CUDA Programming and Performance	71	63941	February 17, 2009
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64403	April 20, 2011

Holiday mystery - v100 vGPU grid proveout fails: Cannot create cuda contexts / alloc memory on most cuda sample kit entries (vmware/rhel/cuda10.2)

Related topics