A6000 is not recognized by nvidia-smi

It was fine but stopped working after turning the server off.
When I replaced A6000 with GTX1060, GTX1060 was recognized by nvidia-smi.

I have no idea what to do anymore.

~$ hostnamectl
Operating System: Ubuntu 20.04.5 LTS
Kernel: Linux 5.4.160-0504160-generic
Architecture: x86-64

~$ lspci -vv | grep -i nvidia
41:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1) (prog-if 00 [VGA controller])
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
41:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)

~$ dkms status
nvidia, 525.53, 5.4.160-0504160-generic, x86_64: installed
nvidia, 525.78.01, 5.15.0-60-generic, x86_64: installed
nvidia, 525.78.01, 5.4.160-0504160-generic, x86_64: built

~$ sudo lshw -C display
  *-display UNCLAIMED
       description: VGA compatible controller
       product: ASPEED Graphics Family
       vendor: ASPEED Technology, Inc.
       physical id: 0
       bus info: pci@0000:2a:00.0
       version: 41
       width: 32 bits
       clock: 33MHz
       capabilities: pm msi vga_controller cap_list
       configuration: latency=0
       resources: memory:c0000000-c3ffffff memory:c4000000-c401ffff ioport:1000(size=128) memory:c0000-dffff
       description: VGA compatible controller
       product: GA102GL [RTX A6000]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:41:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm vga_controller bus_master cap_list rom
       configuration: driver=nvidia latency=0
       resources: iomemory:1040-103f iomemory:1040-103f irq:196 memory:f0000000-f0ffffff memory:10450000000-1045fffffff memory:104de000000-104dfffffff ioport:4000(size=128) memory:f1000000-f107ffff memory:f1080000-f203ffff memory:10060000000-1044fffffff memory:10460000000-104ddffffff

nvidia-bug-report.log (2.7 MB)

[  362.723310] NVRM: GPU 0000:41:00.0: Failed to copy vbios to system memory.
[  362.723384] NVRM: GPU 0000:41:00.0: RmInitAdapter failed! (0x30:0xffff:978)

Please check whether this is a driver bug by downgrading to driver v470. If that doesn’t help, this seems to be defective hardware.

1 Like

Thank you I’ll try and see if that works.