A100 PCIe isn't recognized by BIOS

Here is the EKWB water block for A100 40GB:

Is there any update on this discussion?
I’m also trying to run A100 on Z690 motherboard, but do not want to waste the PCIe line for additional GPU.
Any other options, for example, setting up BIOS, employing an external GPU, or anything, can be considered.
Please help me!

Good news. I have a working cooling solution from Bykski - it has some issues but it works.

So 4 months later I have a working and well cooled A100 PCIe in my desktop - Yay!

I made a post documenting all the details here:

I’m super appreciating your support @user152593 and @ScottEllis!

1 Like

Hi @stas3, I’ve made a build mostly following your build.

We’ve got it to POST, using a 1050ti as 2nd GPU, and managed to install ubuntu 20.04, but it doesn’t detect the A100. Can you share how you installed drivers etc.? Did it just detect everything out of the box?

I only see a couple of obvious differences so far.

BIOS

IGPU Multi-Monitor: Disabled

I don’t find this option anywhere in the BIOS. I’m on BIOS version 1402.

Would you mind sharing your full mobo settings? With a usb formatted to FAT, you can save a txt file with your settings in the BIOS, under

Tool > Profile > Load/Save profile from/to USB

I’d very much appreciate it.

Cooling

At the moment we just slapped on a 3000 rpm noctua fan to a duct (see Linus Tech Tips), and we can definitely feel the heat venting out of it.
According to this post, the A100 can be quite sensitive to overheating and auto-shutoff if it gets too hot… did you have any problems before getting your water-cooling system set up?
The CPU/mobo temperatures all seem quite reasonable, so I’d be surprised if this were the issue, but…


My BIOS settings, in case you’re curious (screenshots + txt file)

Hi @sheim

Here is a working saved profile - for some reason mine isn’t text but in a binary format:
2022-03-30-working.CMO (29.5 KB)

I think some of the BIOS options appear/disappear when you turn other options on/off. I spent so much time trying many different combinations that now I don’t remember when this option appeared.

Perhaps changing: Primary Display [Auto] to something else might reveal new options?

The MOBO firmware version is 1402

The NVIDIA software is: Driver Version: 510.47.03 CUDA Version: 11.6

I initially made it work on Ubuntu 20.04, but later I had various issues and switched to 21.10 (probably 22.04 should be a better option now) and I also pushed the kernel to 5.15 (mainline).

CUDA version shouldn’t matter as long as it’s 11x - I started with an earlier version and then recently updated to 11.6.


Cooling - I haven’t tried using A100 with its original passive radiator other than to see that it was detected and run a very basic test. It was getting hot really fast, so I didn’t use it until I got water cooling figured out.

As you’re saying the passive cooling should be enough to detect the card.


Is it possible that you don’t have enough PSU power to drive A100? I’m using 1200W PSU with 1070Ti and A100.

Perhaps the PCIe insertion order of cards matters? Switching them around perhaps?

Please let me know if I missed anything and you need some additional info.

2 Likes

Thanks @stas3 , tremendously helpful, I just took your CMO and put that on, worked like a charm.
I did turn the wi-fi card back on, adjust the fan speeds, and enable RAM overclocking (just to use the full RAM speed).

In case this is helpful to others, here’s our experience with it. We’re planning to use this primarily for deep RL using isaacGym. Currently we’re air-cooling the GPU using this vent from Linus Tech Tips - it’s not a great vent, but it gets the job done.

Full build: (stas3 means same as @stas3)

  • Asus ROG Maximus XIII Hero Z590, LGA1200 stas3
  • Corsair 7000D ATX PC case stas3
  • Corsair HX 1200 Watt PSU stas3 → this was difficult to get into the (above) case without removing the 3.5" HDD bays. Luckily we didn’t need them, and then there is plenty of space.
  • Corsair Vengeance LPX 64GB (4x 32GB) DDR4 → we noticed isaacGym uses surprisingly a lot of RAM
  • Samsung 970 EVO Plus SSD 2Tb - M.2 NVMe
  • Intel Core i9-11900K CPU NOTE must be a 11th gen CPU to support pcie-4
  • Noctua NF-F12 IPPC 3000 PWM, 120mm fan for GPU vent
  • lots more Noctua fans (4x 140mm, 3x 120mm). This is probably overkill.
  • Corsair iCUE H150i Liquid CPU Cooler
  • GTX 1050ti GPU that we had lying around
  • Bonus get a wheel stand for the case: it’s a big, heavy build, and moving it around is a pain.

The oversized case is nice, both to make the build a bit easier, but also to make sure we could get enough airflow through, and mainly to make sure there was enough space for the 3d-printed vent, since we’re currently air-cooling the GPU. We also didn’t expect to have much load on the CPU, but decided to water-cool it mainly to avoid extra heat in the case.

We used PCIEX16_1 for the A100, and PCIEX16_3 for the 1050ti, and put the harddisk in M.2_3.
This is because PCIEX16_1, PCIEX16_2, M2_1, and M2_2 share bandwidth (see manual). Might be negligible, but ¯_(ツ)_/¯.

We tested this running 9 jobs of isaacGym, with a total of roughly 200k environments, to fill up the GPU memory. FWIW, it’s not really faster than a GTX-3090 (or 3080ti), but the oodles of extra memory just allow you to run many more jobs at a time. At full load, and fans going at ~80% (not even full blast, which was super noisy), we got a pretty stable 52C on the GPU. We might look into liquid-cooling eventually, but it doesn’t seem necessary,

Curiously, on this build, each job took up one CPU thread to 100%, whereas on a 3090 box, the CPU load is well distributed across threads. This might be an isaacGym installation issue, where it isn’t only using cuda, but I debugged it yet (so far just stress-testing to see that temps are okay).

2 Likes

Super! Great to hear you got it working, @sheim

Thank you for sharing the details of your setup. Mine is quite similar. And yes, Corsair HX 1200 Watt is super long. Thermaltake 1200W is shorter if someone doesn’t have the space.

And yes, RTX-3090 is faster than A100 - it has a faster clock
I was able to reproduce the difference with several benchmarks: [Benchmark] HF Trainer on A100 · Issue #15026 · huggingface/transformers · GitHub

1 Like

Hi ScottEllis, am having a z8 G4 hp. I can’t find any “Above 4G Decoding”. I have installed an A100 gpu in it. Kindly help.