We obtained an A100 and spent the whole day yesteday trying to install it in our servers, but without success.

  • Server 1: when the GPU is plugged in, the server boots but does not respond to ping neither ssh, nor shows any graphics output. We have tried on different PCIe slots, and plugging/not plugging a small GPU for graphics. The server only works when the A100 is not plugged in/no power connection. This server has PCIe 4 x16, it is running the latest Linux Mint, and it was working with an RTX3080 without problem (it was removed time ago for another server). These are the specifications:
    BOX RACK 19 ́ ́ 4U 406N-USB3 S/F TOOQ RACK-406N-USB3
    AMD RYZEN 7 5800X AM4 100-100000063WOF
    2 x DDR4 32 GB 3600 Mhz. HyperX HX436C18FB3A/32

  • Server 2: It has a Titan Xp already, and PCIe gen 3, running on Ubuntu 16.04. When the A100 is plugged in, it boots but the GPU does not appear in nvidia-smi. We updated the driver to the latest with apt-get (CUDA 11.3, nvidia465), but the A100 is not found. DeviceQuery shows only Tintan Xp, but after a while, deviceQuery returns: code error 101 → invalid device ordinal. When I run dmesg | grep NVRM, it shows:

[    1.395577] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  465.19.01  Fri Mar 19 07:44:41 UTC 2021
[    3.645960] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0xffff:1214)
... (more lines like these)

My doubts are:

  1. Do we need PCI gen 4 x16 to support A100? Or a server with PCI gen 3 can work?
  2. Do we need to connect two PCI 8-pin connectors from our power source to the adapter of the GPU? We tried only one but didn’t work, so I assume we need both connectors to provide enough power. Our power sources were enough for an RTX3080 on server 1, and for a Titan Xp and a GTX1070 on server 2.
  3. Do we need a data center driver to support A100? or is it enough with the latest driver from CUDA 11?

I expect the PCIe gen3 should work, this card is 250W, so its much more than the RTX 3080, but similar to the TitanXP, so I expect both power connectors are a must.
As for the driver, need to look look for a driver which calls out A100 support which is likely to be only the datacenter drivers.

Hi. Thank you very much for your response, no worries about the delay.

We found the problem: A100 has passive cooling, and requires specific hardware in the server. We plugged the card in other servers, and the newest one (with PCIe4) was able to detect it. However, after a while, the card disappears. When monitoring the temperature, we were able to see that it was raising even while the card was idle, until reaching 100C and then the driver shuts it down.

After some investigation, we figured out that it requires a qualified server, e.g. one listed here Qualified System Catalog | NVIDIA

We were able to purchase a new qualified server with some effort, and we are still waiting for its arrival (in academic world is not easy to purchase new hardware in a smooth way). It was a long path because we didn’t know anything about these specific requirements.

Hope my comment here can help others when acquiring an A100 or similar: Please check the cooling system, if it is passive, then bear in mind that you will need a qualified server.

Best regards

