Computer requirements for A100

Hi,

We obtained an A100 and spent the whole day yesteday trying to install it in our servers, but without success.

  • Server 1: when the GPU is plugged in, the server boots but does not respond to ping neither ssh, nor shows any graphics output. We have tried on different PCIe slots, and plugging/not plugging a small GPU for graphics. The server only works when the A100 is not plugged in/no power connection. This server has PCIe 4 x16, it is running the latest Linux Mint, and it was working with an RTX3080 without problem (it was removed time ago for another server). These are the specifications:
    BOX RACK 19 ́ ́ 4U 406N-USB3 S/F TOOQ RACK-406N-USB3
    GIGABYTE MODULAR
    POWER SOURCE P1000GM 80 PLUS GP-AP1000GM-EU
    AMD RYZEN 7 5800X AM4 100-100000063WOF
    MOTHERBOARD X570 AORUS ULTRA GIGABYTE GA9AX57AUTR-00-10
    2 x DDR4 32 GB 3600 Mhz. HyperX HX436C18FB3A/32
    1 TB SSD SERIES 860 PRO SAMSUNG MZ-76P1T0B/EU 1
    SCYTHE FUN UNIVERSAL FUMA 2 SCFM-2000 1

  • Server 2: It has a Titan Xp already, and PCIe gen 3, running on Ubuntu 16.04. When the A100 is plugged in, it boots but the GPU does not appear in nvidia-smi. We updated the driver to the latest with apt-get (CUDA 11.3, nvidia465), but the A100 is not found. DeviceQuery shows only Tintan Xp, but after a while, deviceQuery returns: code error 101 → invalid device ordinal. When I run dmesg | grep NVRM, it shows:

[    1.395577] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  465.19.01  Fri Mar 19 07:44:41 UTC 2021
[    3.645960] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1214)
... (more lines like these)

My doubts are:

  1. Do we need PCI gen 4 x16 to support A100? Or a server with PCI gen 3 can work?
  2. Do we need to connect two PCI 8-pin connectors from our power source to the adapter of the GPU? We tried only one but didn’t work, so I assume we need both connectors to provide enough power. Our power sources were enough for an RTX3080 on server 1, and for a Titan Xp and a GTX1070 on server 2.
  3. Do we need a data center driver to support A100? or is it enough with the latest driver from CUDA 11?

Thank you very much,

Best regards