A100 PCIe isn't recognized by BIOS

Hmm, @user152593 , I think unfortunately you’re hitting the pitfalls of running datacenter cards that have a large BAR1 space in consumer hosts that aren’t really setup for that. MMIO settings generally are not exposed, so no shock you can’t find them in the Z690-G BIOS. The behavior of “See the card, but them complain about it when you try and use it” is symptomatic of the BIOS giving up on allocating resources, booting anyway, and then the driver reporting the problem to the OS when you try and use it. I’d imagine if this were Linux you’d see no address range assigned to the BAR1 on the card…so you’re out of luck. You would probably have better odds with a “Workstation” motherboard (like https://www.asus.com/us/Motherboards-Components/Motherboards/Workstation/X99-E-10G-WS/ ) which has more comprehensive PCIe capabilities, include a PCIe switch, etc. Look for more PCIe lanes that hang off the CPU or explicit mention of a PCIe switch as a clue that the system might be more suitable.

Practically, even if you do get it working, you’re going to likely run into thermal issues anyway - the A100 PCIe is a passive card that’s really meant to be cooled via chassis fans from a server that can duct air through them. Cards like the RTX A6000 include cooling that’s more appropriate for workstation/desktop environments.

I realize it’s possibly overkill, but we built the DGX Station A100 to help provide a solution to this issue. Figuring out the PCIe and cooling bits, and still having enough useful “stuff” to be a workstation is pretty complex.

Sorry, no easy answer for you there.

ScottE

1 Like

Hi Scott,

Thanks for the quick reply. For cooling we are doing this: Nvidia A100 GPU | Quick water block installation process. - YouTube

What NVIDIA is missing is an eGPU enclosure for single card that is portable and not designed for “office” use. We have DGX PODs and the workstations as well. But we need a powerful mobile field solution and wanted best performance rather than resorting to gaming GPUs.

I will try another MB and see what happens! Thanks for your help.

EJK

I’m sad to hear the one you tried didn’t work, I was hoping that my current MB is just too old and the more modern MBs would just work. If you find a working MB please share which.

re: fast mobile solution - FWIW my benchmarking of machine learning training (pytorch / HF Transformers) with mixed precision was actually faster with RTX-3090 if you limited A100 to the same memory size to compare the same setup (due to the faster clock). So for example 2x RTX-3090 setup is likely to provide a faster overall performance than 1x A100 40GB at a much lower cost and no hardware issues - with an additional electricity cost to feed 2x cards. Mind you, I only benchmarked some of the functionality (full + half precision computations), so I don’t know if my findings apply to other features provided by the cards.

Of course to compete with A100 80GB it’d be much more complicated, as you’d need 4x RTX-3090 cards. But it’s still probably a cheaper solution unless you need the 80GB in one chunk and can’t parallelize your compute needs over 4 x 24GBs. And there will be no MIG and other goodies A100 provides. This is probably not a good substitute.

@user152593, are you referring to this youtube video? Insane benchmarks🔥 Intel Core i9-12900 | DDR5 | A100 | ASUS ProArt Z690 - YouTube

there it says ASUS ProArt Z690-CREATOR WIFI

If you’re referring to a different one could you please share which?

Hi stas3,

No it’s a different video but thanks for sharing that video because they say they failed because they were using the integrated video and it failed due to the output signal not being recognized when having the A100 plugged in. So now I am going to try and put a cheap video card in there because I suspect that ASUS thinks this is a video card when it’s not and is perhaps halting because it thinks it can’t display anything. I thought I had changed a setting in the BIOS to prevent this but now I have another avenue to pursue! Thanks for your comments!!!

Hello everyone,

I finally got my A100 80GB card working with an ASUS ROG STRIX Z690-G GAMING WIFI. Just a quick recap. The motherboard would not POST and kept halting with the white VGA QLED meaning something is wrong with your graphics card or that none is present. I have an Intel Core i9-12900HK which has integrated graphics and the back of the motherboard has an HDMI connector “connected” directly to the integrated graphics. Even though in the BIOS I explicitly stated to use the integrated graphics, whenever the A100 was present, the system would not boot. If I disconnected the A100 power, no issues at all.

So a friend of mine lent me an NVIDIA 1080 card and I put that in just to see what would happen. Same problem. However, when I moved the HDMI cable from the motherboard HDMI connector to the 1080 connector with the monitor on, then voila! I was able to boot without issue and everything works fine. Once booted I can move the cable over. I will log a bug with ASUS and see what they say because right now I am loosing a slot which I need (this board only has 3 PCI slots).

If anyone has any idea given this new information as to what else I can enable or disable in the BIOS to make this work without another graphics card, I would truly appreciate it.

The other thing to note is that you cannot really run this card as is unless you install crazy fans. You need to remove the block and put a water cooling adapter on it (which I am doing next week). For some strange reason, these cards run very hot when idling.

EJK

EJK

2 Likes

@user152593, Thank you very much for posting your success story

I got a new MB: ROG Maximus XIII Hero (z590), but still struggling to get the card recognized.

I tried to follow your recipe and was able to POST but the card is still not recognized

As you said trying to use iGPU (CPU Graphics) and A100 leads to the system not POSTING (d4 - PCI resource allocation error. Out of Resources).

So I did as you said - inserted another old nvidia card and connected it to the monitor. Now it POSTs and boots just fine, but still not seeing A100.

I also tried changing the order of the cards (A100 2nd) - but no change in the outcome. Is your A100 first or 2nd?

This new MB’s BIOS has too many options so it’s a bit hard to know if I have missed some option.

Would it be possible for you to share which BIOS configs have you enabled/disabled.

Also are you connecting 2 independent rails to A100 via the y-connector that came with it? This is what I did.

I also read that one can go into the UEFI Shell and try to figure out what’s wrong there. At least for the D4 situation.

Thank you!

I played some more with BIOS settings and I succeeded using your recipe, @user152593 ! nvidia-smi posts A100 / 80GB detected. Still need to test that it can actually works, but this is a huge step, so very grateful about your sharing.

And I need to go over BIOS and make sure to log all the settings that lead to it working - as I have tried so many different combinations before it worked.

So now need to quickly test that it works and then wait for the cooling block as it heats up really fast.

I will share all the details once I have them written down.

Of course, it’d be great to have iGPU work with this, but perhaps some future BIOS update will fix it.
If you’re posting a bug report somewhere in Asus support please share the link here so that I could 2nd your report. Thank you!

I am still going back and forth with ASUS to see how we can get the A100 working without the need for a PCIe graphics card. I told them that most likely it’s a bug in the BIOS where when it finds no HDMI connector or connection on “a” PCIE graphics card (which the A100 isn’t), then it tries to look for another PCIE graphics card and the logic for some reasons skips the PEG on onboard graphics settings. The if-then-else logic should be:

do I have integrated grahics? Yes
is the BIOS graphics set to PEG? Yes.
is the HDMI cable or display port plugged in to a powered monitor for PEG? Yes
then it does not matter what GPUs I have with respect to VGA POSTing. Just keep going

For the eGPU, what I have discovered is that the reason it works on my Dell (briefly) is because the Thunderbolt controller there allows for monitors over Thunderbolt whereas the native Thunderbolt on my ASUS does not. It expects you to use the HDMI connector and does not support displays over Thunderbolt. So I have purchased and am waiting on a Thunderbolt PCI card from ASUS that does support the display protocol. Which again tells me somehow hardware vendors are not talking to each other. The A100 is NOT a GPU that supports displays. It should be treated as a PCI compute card or something else. The BIOS should create a new category or simply ignore cards that only provide compute. Perhaps that’s an over simplification or it’s a slightly different type of problem but bottom line is the BIOS should ignore all compute cards with respect to VGA, graphics or display tests. We might be paying the price for compute cards having emerged first and foremost as graphics cards for video games.

On another not, I haven’t watched the whole thing but someone said to have a look at this guy here as well for BIOS settings. Seems like the crypto-miners have spent a lot of time debugging this stuff because they don’t want video but rather as much compute as possible.

1 Like

thanks a lot for the details and the video, @user152593 - how could we join efforts at communicating with ASUS support? Surely more users asking should have more weight, no?

and your logic makes total sense, especially wrt A100 not really being a video card anymore.

OK, so for posterity here is my recipe for getting NVIDIA A100 PCIe to work on “ROG Maximus XIII Hero” w/o being able to use iGPU and needing to insert another PCI video card to be used with the monitor.

BIOS setup:

Advanced:

  Advanced System Agent (SA) configuration

    Graphics Configuration:
      Primary Display: Auto (probably could be set to PEG)
      IGPU Multi-Monitor: Disabled

    Memory Configuration:
      Memory Remap: Enabled (above 4GB)

    PCI Subsystem Settings
      Above 4G Decoding: Enabled
      Resize Bar: Enabled
      SR-IOV Support: Enabled

and the reason it wasn’t working originally is because by default it had SR-IOV Support: Disabled

Another side effect of the extra card is that A100 appears as a 2nd gpu in nvidia-smi.

So now waiting for the EKWB water cooling block to arrive.

BTW, I found at least 3 vendors selling the blocks at the moment and here is some other research I have compiled on cooling A100s:

  1. Air cooling:
  1. Water cooling.

There are at least 3 manufacturers of water blocks for A100:

1 Like

FYI, the A100 water block sold by EKWB is not compatible with A100 80GB - it can only work with the 40GB version.

As you can see the 80GB model has a metal frame added around the gpu chip which wasn’t there in the 40GB version:

If one has a CNC machinery one could probably make a square groove in the 40GB water block to make it work. I don’t know if there are any other incompatibilities other than this one.

1 Like

Wow, so the smooth surface of the water block extends past the chip area? Or is it some other part of the block that interferes?

Yes, for the 40GB version it’s flat, you can see that the 40GB one is very different around the main chip and requires no groove:

The image is from Bykski GPU Block , For NVIDIA TESLA A100 40GB , Full Cover Liquid Cool – FormulaMod

Here is the EKWB water block for A100 40GB:

Is there any update on this discussion?
I’m also trying to run A100 on Z690 motherboard, but do not want to waste the PCIe line for additional GPU.
Any other options, for example, setting up BIOS, employing an external GPU, or anything, can be considered.
Please help me!

Good news. I have a working cooling solution from Bykski - it has some issues but it works.

So 4 months later I have a working and well cooled A100 PCIe in my desktop - Yay!

I made a post documenting all the details here:

I’m super appreciating your support @user152593 and @ScottEllis!

1 Like

Hi @stas3, I’ve made a build mostly following your build.

We’ve got it to POST, using a 1050ti as 2nd GPU, and managed to install ubuntu 20.04, but it doesn’t detect the A100. Can you share how you installed drivers etc.? Did it just detect everything out of the box?

I only see a couple of obvious differences so far.

BIOS

IGPU Multi-Monitor: Disabled

I don’t find this option anywhere in the BIOS. I’m on BIOS version 1402.

Would you mind sharing your full mobo settings? With a usb formatted to FAT, you can save a txt file with your settings in the BIOS, under

Tool > Profile > Load/Save profile from/to USB

I’d very much appreciate it.

Cooling

At the moment we just slapped on a 3000 rpm noctua fan to a duct (see Linus Tech Tips), and we can definitely feel the heat venting out of it.
According to this post, the A100 can be quite sensitive to overheating and auto-shutoff if it gets too hot… did you have any problems before getting your water-cooling system set up?
The CPU/mobo temperatures all seem quite reasonable, so I’d be surprised if this were the issue, but…


My BIOS settings, in case you’re curious (screenshots + txt file)

Hi @sheim

Here is a working saved profile - for some reason mine isn’t text but in a binary format:
2022-03-30-working.CMO (29.5 KB)

I think some of the BIOS options appear/disappear when you turn other options on/off. I spent so much time trying many different combinations that now I don’t remember when this option appeared.

Perhaps changing: Primary Display [Auto] to something else might reveal new options?

The MOBO firmware version is 1402

The NVIDIA software is: Driver Version: 510.47.03 CUDA Version: 11.6

I initially made it work on Ubuntu 20.04, but later I had various issues and switched to 21.10 (probably 22.04 should be a better option now) and I also pushed the kernel to 5.15 (mainline).

CUDA version shouldn’t matter as long as it’s 11x - I started with an earlier version and then recently updated to 11.6.


Cooling - I haven’t tried using A100 with its original passive radiator other than to see that it was detected and run a very basic test. It was getting hot really fast, so I didn’t use it until I got water cooling figured out.

As you’re saying the passive cooling should be enough to detect the card.


Is it possible that you don’t have enough PSU power to drive A100? I’m using 1200W PSU with 1070Ti and A100.

Perhaps the PCIe insertion order of cards matters? Switching them around perhaps?

Please let me know if I missed anything and you need some additional info.

2 Likes

Thanks @stas3 , tremendously helpful, I just took your CMO and put that on, worked like a charm.
I did turn the wi-fi card back on, adjust the fan speeds, and enable RAM overclocking (just to use the full RAM speed).

In case this is helpful to others, here’s our experience with it. We’re planning to use this primarily for deep RL using isaacGym. Currently we’re air-cooling the GPU using this vent from Linus Tech Tips - it’s not a great vent, but it gets the job done.

Full build: (stas3 means same as @stas3)

  • Asus ROG Maximus XIII Hero Z590, LGA1200 stas3
  • Corsair 7000D ATX PC case stas3
  • Corsair HX 1200 Watt PSU stas3 → this was difficult to get into the (above) case without removing the 3.5" HDD bays. Luckily we didn’t need them, and then there is plenty of space.
  • Corsair Vengeance LPX 64GB (4x 32GB) DDR4 → we noticed isaacGym uses surprisingly a lot of RAM
  • Samsung 970 EVO Plus SSD 2Tb - M.2 NVMe
  • Intel Core i9-11900K CPU NOTE must be a 11th gen CPU to support pcie-4
  • Noctua NF-F12 IPPC 3000 PWM, 120mm fan for GPU vent
  • lots more Noctua fans (4x 140mm, 3x 120mm). This is probably overkill.
  • Corsair iCUE H150i Liquid CPU Cooler
  • GTX 1050ti GPU that we had lying around
  • Bonus get a wheel stand for the case: it’s a big, heavy build, and moving it around is a pain.

The oversized case is nice, both to make the build a bit easier, but also to make sure we could get enough airflow through, and mainly to make sure there was enough space for the 3d-printed vent, since we’re currently air-cooling the GPU. We also didn’t expect to have much load on the CPU, but decided to water-cool it mainly to avoid extra heat in the case.

We used PCIEX16_1 for the A100, and PCIEX16_3 for the 1050ti, and put the harddisk in M.2_3.
This is because PCIEX16_1, PCIEX16_2, M2_1, and M2_2 share bandwidth (see manual). Might be negligible, but ¯_(ツ)_/¯.

We tested this running 9 jobs of isaacGym, with a total of roughly 200k environments, to fill up the GPU memory. FWIW, it’s not really faster than a GTX-3090 (or 3080ti), but the oodles of extra memory just allow you to run many more jobs at a time. At full load, and fans going at ~80% (not even full blast, which was super noisy), we got a pretty stable 52C on the GPU. We might look into liquid-cooling eventually, but it doesn’t seem necessary,

Curiously, on this build, each job took up one CPU thread to 100%, whereas on a 3090 box, the CPU load is well distributed across threads. This might be an isaacGym installation issue, where it isn’t only using cuda, but I debugged it yet (so far just stress-testing to see that temps are okay).

2 Likes