Hello,
I have power machine with 4 GPU cards which were working fine till recently.
Now one of the cards is not detected by nvidia-smi.
# lspci | grep Tesla
0004:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0004:05:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0035:03:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
0035:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
# nvidia-smi --query-gpu=index,gpu_bus_id,power.draw --format=csv
index, pci.bus_id, power.draw [W]
0, 00000004:04:00.0, 71.88 W
1, 00000035:03:00.0, 50.92 W
2, 00000035:04:00.0, 53.40 W
# dmesg | grep NVRM
[7251924.555504] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[7251924.555559] NVRM: rm_init_adapter failed for device bearing minor number 1
[7251934.627039] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[7251934.627096] NVRM: rm_init_adapter failed for device bearing minor number 1
[7251944.412880] NVRM: RmInitAdapter failed! (0x26:0xffff:1114)
[7251944.412922] NVRM: rm_init_adapter failed for device bearing minor number 1
# cat /proc/driver/nvidia/gpus/0004\:05\:00.0/information
Model: Tesla V100-SXM2-16GB
IRQ: 149
GPU UUID: GPU-????????-????-????-????-????????????
Video BIOS: ??.??.??.??.??
Bus Type: PCIe
DMA Size: 64 bits
DMA Mask: 0xffffffffffffffff
Bus Location: 0004:05:00.0
Device Minor: 1
Can some help me to fix this issue ?