A30 unable to launch any kernel

demo.masouros · July 9, 2022, 11:05am

Hello everyone,

We are having issues with an A30 GPU card installed on our server. The specs of our system are the following:

Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz with 126GB of RAM
Linux kernel: 5.4.0-121-generic
1x NVIDIA A30
1x Tesla V100
CUDA version: 11.7 / Driver Version: 515.48.07
NVCC version: 11.7

We have blacklisted nouveau drivers, both within grub and in /etc/modprobe.d/blacklist-nvidia-nouveau.conf.
The system was working properly up until last week, but after then the A30 is not able to run any kernel. We have re-installed all the drivers but the issue was not fixed. nvidia-smi is recognizing the card, as shown below:

However, A30 is not able to launch any kernel. What we have tested:

Run a simple kernel (matrix-mul from the following repo GitHub - kberkay/Cuda-Matrix-Multiplication: Matrix Multiplication on GPU using Shared Memory considering Coalescing and Bank Conflicts) on both A30 and V100 (code also attached)
mm.cu (8.2 KB)

On V100 the application is executed successfully as you can see below:

After the completion of the kernel we get a success message:

GPU time= 5.651456 ms
CPU time= 76829.059000 ms
Results are equal!

On V100 the same kernel does not run on the GPU:

Moreover, we get the following output:

GPU time= -0.000000 ms
CPU time= 75701.587000 ms
NOT EQUAL
Results are NOT equal!

showing that the GPU did not perform any execution.

dmesg messages regarding nvidia and NVRM are the following:
nvidia:

pl4tinum@davinci:~$ dmesg | grep nvidia
[    3.207155] nvidia: loading out-of-tree module taints kernel.
[    3.207169] nvidia: module license 'NVIDIA' taints kernel.
[    3.244573] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    3.252388] nvidia-nvlink: Nvlink Core is being initialized, major device number 239
[    3.379058] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  515.48.07  Fri May 27 03:18:00 UTC 2022
[    3.714251] [drm] [nvidia-drm] [GPU ID 0x00003b00] Loading driver
[    5.999727] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:3b:00.0 on minor 1
[    5.999802] [drm] [nvidia-drm] [GPU ID 0x0000af00] Loading driver
[    6.942172] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:af:00.0 on minor 2
[    8.597268] nvidia-uvm: Loaded the UVM driver, major device number 236.
[   29.563178] audit: type=1400 audit(1657361126.061:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1711 comm="apparmor_parser"
[   29.563181] audit: type=1400 audit(1657361126.061:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1711 comm="apparmor_parser"

NVRM:

[    3.346856] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  515.48.07  Fri May 27 03:26:43 UTC 2022

We have also tested running GPU-accelerated NN layers from Pytorch and we get the following message:

/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:83: UserWarning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0

I am also attaching the file produced by nvidia-bug-report.sh script.
nvidia-bug-report.log.gz (15.1 MB)

Could this be a HW malfunction or something else?

Thanks in advance for your help!

generix · July 9, 2022, 11:53am

The A30 has MIG-mode enabled so it can’t be used. Please use nvidia-smi to turn it off.

generix · July 9, 2022, 11:55am

In general, you should also enable nvidia-persistenced to start on boot.

demo.masouros · July 9, 2022, 2:29pm

After disabling MIG and rebooting the server, we can now run kernels.

Thanks a lot for your fast reply!

system · July 23, 2022, 2:29pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nvidia A30 \| Ubuntu 22.04 \| Issue Linux	12	1290	December 18, 2023
NVIDIA A30X GPU driver not working on Ubuntu GPU - Hardware cuda , ubuntu , software-and-drivers , linux-driver	0	99	November 1, 2024
Issue for installing cuda drivers for A30 CUDA Setup and Installation cuda , drm	4	3107	December 2, 2021
Nvidia-smi -L can't see one of the A30 GPUs Linux	3	689	January 16, 2024
NVIDIA A30 no device were found HP Proliant 585 G7 centOS 7 Linux	5	1408	December 16, 2021
CUDA initialization failed for one of A30 GPU in a 8xA30 cluster (but other 7 GPUs in the cluster works fine) CUDA Setup and Installation cuda , ubuntu , pytorch	0	759	August 11, 2023
Ubuntu 20.04 with Kernel 5.13.0-30-generic doesn't recognize RTX 3080 ti Laptop GPU Linux kernel , ubuntu , nvidia-smi	9	6430	February 28, 2022
nvidia-smi "No devices were found" error CUDA Setup and Installation	23	63147	February 14, 2021
Fedora 40: Nvidia driver running at random boots with kernel 6.8.9-300.fc40.x86_64, with 6.8.10 or 6.8.11 it doesn't run at all Linux boot , kernel	2	621	July 14, 2024
RTX 3070's not working, Driver Version: 470, Kernel Version: 5.13, MSI, Manjaro Linux, GNOME Linux kernel , driver	8	2939	November 21, 2021

A30 unable to launch any kernel

Related topics