Ubuntu 16.4 cuda 10.1 GV100GL fail

root@mky-KVM:~# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION=“Ubuntu 16.04.7 LTS”

root@mky-KVM:~# nvidia-smi
Mon Dec 28 09:52:00 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56 Driver Version: 418.56 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE… Off | 00000000:00:0C.0 Off | 0 |
| N/A 39C P0 36W / 250W | 0MiB / 32480MiB | 3% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

root@mky-KVM:~# nvcc -V
nvcc: NVIDIA ® Cuda compiler driver
Copyright © 2005-2019 NVIDIA Corporation
Built on Fri_Feb__8_19:08:17_PST_2019
Cuda compilation tools, release 10.1, V10.1.105

root@mky-KVM:~# lspci

00:0c.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 32GB] (rev a1)

deviceQuery is running normally:

root@mky-KVM:~/NVIDIA_CUDA-10.1_Samples# /root/NVIDIA_CUDA-10.1_Samples/bin/x86_64/linux/release/deviceQuery
/root/NVIDIA_CUDA-10.1_Samples/bin/x86_64/linux/release/deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “Tesla V100-PCIE-32GB”
CUDA Driver Version / Runtime Version 10.1 / 10.1
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 32480 MBytes (34058272768 bytes)
(80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1380 MHz (1.38 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 7 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 12
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 1
Result = PASS

bandwidthTest is not running properly:

root@mky-KVM:~/NVIDIA_CUDA-10.1_Samples# /root/NVIDIA_CUDA-10.1_Samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: Tesla V100-PCIE-32GB
Quick Mode

CUDA error at bandwidthTest.cu:730 code=46(cudaErrorDevicesUnavailable) “cudaEventCreate(&start)”

The inference framework always reports:
all CUDA-capable devices are busy or unavailable

root@mky-KVM:~/darknet# ./darknet detect cfg/yolov4.cfg yolov4.weights data/dog.jpg -gpus 0
CUDA-version: 10010 (10010), cuDNN: 7.6.5, GPU count: 1
OpenCV version: 3.4.4
0 : compute_capability = 700, cudnn_half = 0, GPU: Tesla V100-PCIE-32GB
net.optimized_memory = 0
mini_batch = 1, batch = 8, time_steps = 1, train = 0
layer filters size/strd(dil) input output
0 Try to set subdivisions=64 in your cfg-file.
CUDA status Error: file: ./src/dark_cuda.c : () : line: 373 : build time: Dec 26 2020 - 15:01:10

CUDA Error: all CUDA-capable devices are busy or unavailable
CUDA Error: all CUDA-capable devices are busy or unavailable: Success
darknet: ./src/utils.c:331: error: Assertion `0’ failed.
Aborted (core dumped)