CUDA initialization failed for Pytorch using Nvidia Tesla M6 on ESXi 6.7

We are trying to run Tesla M6 GPU on VMware vSphere Hypervisor (ESXi) 6.7.
The Operating System used on VM is RHEL 8.6.

We have installed the following Nvidia GPU driver:

Version: 470.129.06
Release Date: 2022.5.16
Operating System: Linux 64-bit RHEL 8
CUDA Toolkit: 11.4

The driver is installed successfully, and gives the following output when we run
$ nvidia-smi

image

We also installed the compatible CUDA version i.e 11.4 from the below link:

Both GPU and CUDA drivers were installed succesfully, we verified CUDA by running:
$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Wed_Jun__2_19:15:15_PDT_2021
Cuda compilation tools, release 11.4, V11.4.48
Build cuda_11.4.r11.4/compiler.30033411_0

But when we verify if CUDA is working fine or not by testing the CUDA Samples 11.4 deviceQuery, the test fails:
$./deviceQuery
./deviceQuery Starting…
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 3
→ initialization error Result = FAIL

We tried to check if ther is any error using dmesg:
$dmesg | grep -E “NVRM|nvidia”
[ 2.827680] nvidia: loading out-of-tree module taints kernel.
[ 2.827693] nvidia: module license ‘NVIDIA’ taints kernel.
[ 2.840425] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[ 2.850492] nvidia-nvlink: Nvlink Core is being initialized, major device number 242
[ 2.851684] nvidia 0000:02:01.0: enabling device (0300 → 0302)
[ 2.853120] nvidia 0000:02:01.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[ 2.853420] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 470.129.06 Thu May 12 22:52:02 UTC 2022
[ 2.900185] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 470.129.06 Thu May 12 22:42:45 UTC 2022
[ 2.904634] [drm] [nvidia-drm] [GPU ID 0x00000201] Loading driver
[ 2.904637] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:02:01.0 on minor 1
[ 57.834904] nvidia-uvm: Loaded the UVM driver, major device number 240.

Another way to verify if CUDA was working fine or not by checking with pytorch:
$python3.8

>>> import torch
>>> torch.__version__
'1.11.0+cu113'
>>> torch.version.cuda
'11.3'
>>> torch.cuda.is_available()
/opt/platformx/sentiment_analysis/gpu_env/lib64/python3.8/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
False

We also tried to change the compute mode and virtual mode using nvidia-smi commands, but it was not supported.

#nvidia-smi -i 0 -c 0
Setting compute mode to DEFAULT is not supported.
Unable to set the compute mode for GPU 00000000:02:01.0: Not Supported
Treating as warning and moving on.
All done.

#nvidia-smi -i 0 -vm 3
Setting virtualization mode is not supported for GPU 00000000:02:01.0.
Treating as warning and moving on.
All done.

#nvidia-smi -i 0 --virt-mode=3
Setting virtualization mode is not supported for GPU 00000000:02:01.0.
Treating as warning and moving on.
All done.

We have tried various versions of GPU as well as CUDA drivers(11.2 to 11.4) but still the issue persists.
The main question arises is whether Tesla M6 can run on Virtual Machine or does it need Physical Machine only?
Also, its not clear from the documentation at:

In section 2.1 Supported NVIDIA GPUs and Validated ServerPlatforms, it’s not mentioned whether Tesla M6 supports Virtual Machine or not.
It mentions :

	•GPUs based on the NVIDIA Maxwell™ graphic architecture: 
	    ◦Tesla M6 (NVIDIA Virtual Compute Server (vCS) is not supported.)

What’s difference between a Virtual Machine and Nvidia Virtual Compute Server? Will Tesla M6 work on ESXi 6.7 for computation?

Thanks

Hi, Have you gotten it working? I am having the same issue but unsure what to do with it?

Thanks!