A100 PCIe isn't recognized by BIOS

I’m trying to get A100 80GB PCIe running:

Hardware setup:

  • MB: Z390 Aorus Pro
  • 1200W PSU - used 2 independent 12v rails PCIe 8pin via the provided joiner connector.
  • Using x16 slot

The problem: The BIOS doesn’t recognize the card.

rtx-3090 works just fine and I tried to remove it as well so only one card is used, but there was no change.

There is plenty of power and the card gets hot so it gets the power.

The MB is PCIe-3, but I verified that the PCIe-4 device should work just fine, just slower.

As this is very new hardware I can’t find any info about it, other than a brief doc https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/PB-10577-001_v02.pdf

I made sure to enable above 4g decoding in the bios, which I found to be important for some older generations.

Any suggestions on what else I could try?

Thank you.

Hi @stas3 ,

In general the passive cards like A100 PCIe are meant for servers not desktop systems, and we work with our OEMs and integration partners to qualify specific combinations of cards and servers (checking for airflow, BIOS compatibility, performance, etc.) - in hopes of avoiding the issues you’re seeing here. :-) For workstations, the RTX series of cards is usually a more compatible and better choice. Saying all that just as a preface so you know that we’re in uncharted and unsupported territory.

Assuming power and cooling aren’t a problem in your situation, one of the bigger things that the NVIDIA A100 and similar datacenter cards have is a large BAR1 space, which is why “Above 4G Decoding” is required. It also means that you may need to adjust MMIO Size and MMIO Base (if those are exposed in the Aorus Pro BIOS, they often aren’t), so that the BIOS can get some help mapping the card resources into the PA space. Do you have those knobs in the BIOS? Can you adjust the base lower and the size bigger?

ScottE

1 Like

Thank you very much for this detailed follow up, Scott

I totally understand that this is uncharted territories.

This BIOS has no MMIO* knobs. I tried to Google for any discussion in the context of this MB, but couldn’t find any.

Following your questions I installed the last month’s BIOS firmware which once “Above 4G Decoding” is enabled gives a new option:

Resizable Bar which has two options: Auto/Disabled and I set it to Auto. The brief one line description of that option said that the device needing this feature will do the right thing if set to Auto.

You can see the change here:

It says:

Enable Resizable Base-Address Register (Resizable-BAR) option to enhance GPU performance.

But the card is still not recognized in BIOS.

Here is the snapshot of the relevant settings (sorry about fuzzy outcome)

Additionally, some suggested to disable CSM Support. When I do that, it takes like 2 minutes to get to the BIOS screen and it never gets there and I get a frozen screen:


So what alternative solutions would you recommend, Scott.

Is there a desktop MB that you know that works with this card?

Thank you.

Yeah, re-sizable BAR is a slightly different functionality. With that, the card, BIOS, and other elements can indicate preferences for space, etc. Related to what we want, but not exactly. I’m not aware of any desktop MBs that include all the tuning knobs you’d need (realistically those are often not even exposed in server BIOS’ - the OEM will tune and adjust the BIOS to work as part of their GPU qualification process, not necessarily expose them to end users).

Regarding CSM, you almost certainly don’t need CSM (CSM lets peripherals and OS’ that aren’t modern enough for UEFI to be used - that’s not the case in your system!), but not sure that’s really related to this particular problem.

I’m not sure where else to go with this.

What is the reason you’re building your own system rather than going for something like a DGX Station A100 (where we’ve integrated 4x NVIDIA A100 GPUs already)? Or are using an NVIDIA A100 GPU instead of a more workstation-appropriate card like the RTX 8000?

(Edit, since I forgot to give a useful suggestion…)

Unlikely to work, but something to try on the Z390 MB is to disable as many other peripherals as possible - disable any unused NIC, SATA controllers, audio, etc. You’re basically looking to free up any address space that could be grabbed by another device in the system (and hence preventing the BIOS from being able to 'fit" the A100). I don’t have high hopes, but that’s another thing to try since you’re experimenting!

2 Likes

Thank you for the information of what is not relevant and a practical suggestion, Scott.

Is there a way to check if the system is low on address space from the booted OS, rather than turning peripherals off in BIOS? I’m on Ubuntu. I’m actually tapping into most of the MB features so I’m not sure if I have much that can be turned off. If you tell me what specifically you’re looking for I can figure out how to get that information if you don’t already have the instructions.

The card was gifted to me hence the odd circumstance of unusual/unintended setup.

Is there an even anecdotal info about someone trying to run A100 PCIe on a desktop MB and succeeding at it? I totally understand that you probably won’t want to stand behind any such recommendations because the intended use is a server. And anecdotal info is just that – anecdotal and should I take the risk to buy the hardware mentioned in the anecdote the responsibility is totally mine to bear. It’s just that this hardware is so new, I wasn’t able to find any other information on the Internet.

Thanks again!

Hi Scott,

Having the same issue here with an ROG STRIX Z690-G and 80Gb A100. We have DGX Pods on site but need an A100 card in a non-server config using a small form factor machine for non office use. We cannot use the “traditional” enclosures you can get for data centers that use these cards. Any more ideas as to what else I can investigate? I enabled above 4g decoding but don’t see any options for adjusting MMIO Size and Base. There is a youtube video of a guy installing one of these in a desktop mb so I assumed this was possible somehow. I also tried disabling a few things but no luck.

I did have some luck using the Razer Core eGPU chasis. Card shows up and is recognized by the system BUT refuses to work. Windows finds the card, has the right driver but then a few seconds later Windows complains about resources not being available for that device.

EJK

1 Like

Hmm, @user152593 , I think unfortunately you’re hitting the pitfalls of running datacenter cards that have a large BAR1 space in consumer hosts that aren’t really setup for that. MMIO settings generally are not exposed, so no shock you can’t find them in the Z690-G BIOS. The behavior of “See the card, but them complain about it when you try and use it” is symptomatic of the BIOS giving up on allocating resources, booting anyway, and then the driver reporting the problem to the OS when you try and use it. I’d imagine if this were Linux you’d see no address range assigned to the BAR1 on the card…so you’re out of luck. You would probably have better odds with a “Workstation” motherboard (like X99-E-10G WS|Motherboards|ASUS USA ) which has more comprehensive PCIe capabilities, include a PCIe switch, etc. Look for more PCIe lanes that hang off the CPU or explicit mention of a PCIe switch as a clue that the system might be more suitable.

Practically, even if you do get it working, you’re going to likely run into thermal issues anyway - the A100 PCIe is a passive card that’s really meant to be cooled via chassis fans from a server that can duct air through them. Cards like the RTX A6000 include cooling that’s more appropriate for workstation/desktop environments.

I realize it’s possibly overkill, but we built the DGX Station A100 to help provide a solution to this issue. Figuring out the PCIe and cooling bits, and still having enough useful “stuff” to be a workstation is pretty complex.

Sorry, no easy answer for you there.

ScottE

1 Like

Hi Scott,

Thanks for the quick reply. For cooling we are doing this: Nvidia A100 GPU | Quick water block installation process. - YouTube

What NVIDIA is missing is an eGPU enclosure for single card that is portable and not designed for “office” use. We have DGX PODs and the workstations as well. But we need a powerful mobile field solution and wanted best performance rather than resorting to gaming GPUs.

I will try another MB and see what happens! Thanks for your help.

EJK

I’m sad to hear the one you tried didn’t work, I was hoping that my current MB is just too old and the more modern MBs would just work. If you find a working MB please share which.

re: fast mobile solution - FWIW my benchmarking of machine learning training (pytorch / HF Transformers) with mixed precision was actually faster with RTX-3090 if you limited A100 to the same memory size to compare the same setup (due to the faster clock). So for example 2x RTX-3090 setup is likely to provide a faster overall performance than 1x A100 40GB at a much lower cost and no hardware issues - with an additional electricity cost to feed 2x cards. Mind you, I only benchmarked some of the functionality (full + half precision computations), so I don’t know if my findings apply to other features provided by the cards.

Of course to compete with A100 80GB it’d be much more complicated, as you’d need 4x RTX-3090 cards. But it’s still probably a cheaper solution unless you need the 80GB in one chunk and can’t parallelize your compute needs over 4 x 24GBs. And there will be no MIG and other goodies A100 provides. This is probably not a good substitute.

@user152593, are you referring to this youtube video? Insane benchmarks🔥 Intel Core i9-12900 | DDR5 | A100 | ASUS ProArt Z690 - YouTube

there it says ASUS ProArt Z690-CREATOR WIFI

If you’re referring to a different one could you please share which?

Hi stas3,

No it’s a different video but thanks for sharing that video because they say they failed because they were using the integrated video and it failed due to the output signal not being recognized when having the A100 plugged in. So now I am going to try and put a cheap video card in there because I suspect that ASUS thinks this is a video card when it’s not and is perhaps halting because it thinks it can’t display anything. I thought I had changed a setting in the BIOS to prevent this but now I have another avenue to pursue! Thanks for your comments!!!

Hello everyone,

I finally got my A100 80GB card working with an ASUS ROG STRIX Z690-G GAMING WIFI. Just a quick recap. The motherboard would not POST and kept halting with the white VGA QLED meaning something is wrong with your graphics card or that none is present. I have an Intel Core i9-12900HK which has integrated graphics and the back of the motherboard has an HDMI connector “connected” directly to the integrated graphics. Even though in the BIOS I explicitly stated to use the integrated graphics, whenever the A100 was present, the system would not boot. If I disconnected the A100 power, no issues at all.

So a friend of mine lent me an NVIDIA 1080 card and I put that in just to see what would happen. Same problem. However, when I moved the HDMI cable from the motherboard HDMI connector to the 1080 connector with the monitor on, then voila! I was able to boot without issue and everything works fine. Once booted I can move the cable over. I will log a bug with ASUS and see what they say because right now I am loosing a slot which I need (this board only has 3 PCI slots).

If anyone has any idea given this new information as to what else I can enable or disable in the BIOS to make this work without another graphics card, I would truly appreciate it.

The other thing to note is that you cannot really run this card as is unless you install crazy fans. You need to remove the block and put a water cooling adapter on it (which I am doing next week). For some strange reason, these cards run very hot when idling.

EJK

EJK

2 Likes

@user152593, Thank you very much for posting your success story

I got a new MB: ROG Maximus XIII Hero (z590), but still struggling to get the card recognized.

I tried to follow your recipe and was able to POST but the card is still not recognized

As you said trying to use iGPU (CPU Graphics) and A100 leads to the system not POSTING (d4 - PCI resource allocation error. Out of Resources).

So I did as you said - inserted another old nvidia card and connected it to the monitor. Now it POSTs and boots just fine, but still not seeing A100.

I also tried changing the order of the cards (A100 2nd) - but no change in the outcome. Is your A100 first or 2nd?

This new MB’s BIOS has too many options so it’s a bit hard to know if I have missed some option.

Would it be possible for you to share which BIOS configs have you enabled/disabled.

Also are you connecting 2 independent rails to A100 via the y-connector that came with it? This is what I did.

I also read that one can go into the UEFI Shell and try to figure out what’s wrong there. At least for the D4 situation.

Thank you!

I played some more with BIOS settings and I succeeded using your recipe, @user152593 ! nvidia-smi posts A100 / 80GB detected. Still need to test that it can actually works, but this is a huge step, so very grateful about your sharing.

And I need to go over BIOS and make sure to log all the settings that lead to it working - as I have tried so many different combinations before it worked.

So now need to quickly test that it works and then wait for the cooling block as it heats up really fast.

I will share all the details once I have them written down.

Of course, it’d be great to have iGPU work with this, but perhaps some future BIOS update will fix it.
If you’re posting a bug report somewhere in Asus support please share the link here so that I could 2nd your report. Thank you!

I am still going back and forth with ASUS to see how we can get the A100 working without the need for a PCIe graphics card. I told them that most likely it’s a bug in the BIOS where when it finds no HDMI connector or connection on “a” PCIE graphics card (which the A100 isn’t), then it tries to look for another PCIE graphics card and the logic for some reasons skips the PEG on onboard graphics settings. The if-then-else logic should be:

do I have integrated grahics? Yes
is the BIOS graphics set to PEG? Yes.
is the HDMI cable or display port plugged in to a powered monitor for PEG? Yes
then it does not matter what GPUs I have with respect to VGA POSTing. Just keep going

For the eGPU, what I have discovered is that the reason it works on my Dell (briefly) is because the Thunderbolt controller there allows for monitors over Thunderbolt whereas the native Thunderbolt on my ASUS does not. It expects you to use the HDMI connector and does not support displays over Thunderbolt. So I have purchased and am waiting on a Thunderbolt PCI card from ASUS that does support the display protocol. Which again tells me somehow hardware vendors are not talking to each other. The A100 is NOT a GPU that supports displays. It should be treated as a PCI compute card or something else. The BIOS should create a new category or simply ignore cards that only provide compute. Perhaps that’s an over simplification or it’s a slightly different type of problem but bottom line is the BIOS should ignore all compute cards with respect to VGA, graphics or display tests. We might be paying the price for compute cards having emerged first and foremost as graphics cards for video games.

On another not, I haven’t watched the whole thing but someone said to have a look at this guy here as well for BIOS settings. Seems like the crypto-miners have spent a lot of time debugging this stuff because they don’t want video but rather as much compute as possible.

1 Like

thanks a lot for the details and the video, @user152593 - how could we join efforts at communicating with ASUS support? Surely more users asking should have more weight, no?

and your logic makes total sense, especially wrt A100 not really being a video card anymore.

OK, so for posterity here is my recipe for getting NVIDIA A100 PCIe to work on “ROG Maximus XIII Hero” w/o being able to use iGPU and needing to insert another PCI video card to be used with the monitor.

BIOS setup:

Advanced:

  Advanced System Agent (SA) configuration

    Graphics Configuration:
      Primary Display: Auto (probably could be set to PEG)
      IGPU Multi-Monitor: Disabled

    Memory Configuration:
      Memory Remap: Enabled (above 4GB)

    PCI Subsystem Settings
      Above 4G Decoding: Enabled
      Resize Bar: Enabled
      SR-IOV Support: Enabled

and the reason it wasn’t working originally is because by default it had SR-IOV Support: Disabled

Another side effect of the extra card is that A100 appears as a 2nd gpu in nvidia-smi.

So now waiting for the EKWB water cooling block to arrive.

BTW, I found at least 3 vendors selling the blocks at the moment and here is some other research I have compiled on cooling A100s:

  1. Air cooling:
  1. Water cooling.

There are at least 3 manufacturers of water blocks for A100:

1 Like

FYI, the A100 water block sold by EKWB is not compatible with A100 80GB - it can only work with the 40GB version.

As you can see the 80GB model has a metal frame added around the gpu chip which wasn’t there in the 40GB version:

If one has a CNC machinery one could probably make a square groove in the 40GB water block to make it work. I don’t know if there are any other incompatibilities other than this one.

1 Like

Wow, so the smooth surface of the water block extends past the chip area? Or is it some other part of the block that interferes?

Yes, for the 40GB version it’s flat, you can see that the 40GB one is very different around the main chip and requires no groove:

The image is from Bykski GPU Block , For NVIDIA TESLA A100 40GB , Full Cover Liquid Cool – FormulaMod