P100 not showing up in nvidia-smi

Hi,

I was using a NVIDIA Titan X card on a computer and it was working fine, but when I changed the card to NVIDIA Tesla P100 the card does not show up in nvidia-smi. I updated the drivers to 375.51.

Output of lspci | grep NVIDIA

0f:00.0 VGA compatible controller: NVIDIA Corporation GF106GL [Quadro 2000] (rev a1)
0f:00.1 Audio device: NVIDIA Corporation GF106 High Definition Audio Controller (rev a1)
42:00.0 3D controller: NVIDIA Corporation Device 15f8 (rev a1)

Output of dmesg |grep NVRM

[   22.298530] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[   22.298530] NVRM: BAR1 is 0M @ 0x0 (PCI:0000:42:00.0)
[   22.298532] NVRM: The system BIOS may have misconfigured your GPU.
[   22.298565] NVRM: The NVIDIA probe routine failed for 1 device(s).
[   22.298567] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  375.51  Wed Mar 22 10:26:12 PDT 2017 (using threaded interrupts)
[   27.468189] NVRM: Your system is not currently configured to drive a VGA console
[   27.468192] NVRM: on the primary VGA device. The NVIDIA Linux graphics driver
[   27.468193] NVRM: requires the use of a text-mode VGA console. Use of other console
[   27.468194] NVRM: drivers including, but not limited to, vesafb, may result in
[   27.468195] NVRM: corruption and stability problems, and is not supported.

There is your problem. Check your BIOS setup to see whether you can dial in a BAR0 aperture of the required size. You may need to install the latest system BIOS for your platform for this to work, or your system BIOS may not support this at all.

What is your host system (maker, model)? Is this a server enclosure that can provide sufficient forced air flow to cool the passively cooled Tesla P100? If not, the GPU will overheat quickly and shut itself down to prevent permanent damage. Or is this an actively cooled Quadro GP100 by any chance?

Are there any specific or minimum hardware requirements to make it work ?

With modern Tesla GPus, the “safe” and (as far as I can perceive NVIDIA’s intentions) intended approach is that customers buy them already integrated into a system from an integrator that has partnered with NVIDIA. The integrators are aware of all the technical issues that use of Tesla GPUs entails. NVIDIA provides a handy list of integrators here: [url]Page Not Found | NVIDIA

If you build your own home-brew system with a Tesla GPU, you are pretty much on your own. Numerous posts in these forums demonstrate that people run into problems doing that. I would assume most of them do not have a background in building and configuring HPC systems.

While I am tangentially familiar with some of the more common issues that arise when adding a Tesla GPU to a system (such as the requirement for a large BAR0 aperture), I am neither familiar with the P100 nor do I know your motherboard or server system. You can poke around in your system BIOS setup to see what options if provides for setting up the aperture. And make sure you provide proper cooling for the P100.

Thanks for the help, I have systems with the following three types of motherboard-cpu combination:

  • Asus MAXIMUS VIII HERO with Intel(R) Core(TM) i7-6700K CPU
  • Asus X99-E WS with Intel(R) Xeon(R) CPU E5-2620 v3
  • HP 0AECh with Intel(R) Xeon(R) CPU X5690

Will it work with any of the above configurations ?

I have encountered the same problem as you. How can you solve this problem?

1 Like

Order the P100 in an OEM server, from a OEM that has designed the system to support P100.

I have a P100 that nvidia-smi recognises in win10 (see screenshot attached). I bought and returned 3 other P100 that nvidia-smi did not recognise. Why is this and what correction is needed for new P100s ? Thanks. (I am delighted with the performance of the P100 I have)

looks to me like nvidia has designed the P100 to support OEM server manufacturers’ and nvidia’s interests. A bit of a stitch up really. To be honest, there’s no great task in blowing a lot of air through the P100. It’s been running on my win10 workstation for many months, cool and “overclocked” at 1329. It’s not a dark art to run it well. It is a dark art trying to figure out how nvidia has locked ‘most’ P100s to expensive OEM servers

Seriously? :-)
When the P100 was released, the launch price was 5699 2016 US dollars. This probably restricted it to data centre users doing serious work and who would require ECC RAM - the sort of customer who wouldn’t want to be spending that sort of money with no guarantee all the parts would work together.

Table 3 in this document contains a range of server chassis that have been qualified for use with the P100. At least one poster on this forum has sourced something suitable from Ebay, for not much money.

Regarding the P100’s you returned. Did you know for certain that they were fully functional units?

These are ebay P100s (~£300 each) - each from a different vendor. I’m not really in the market to buy a server chassis ( I may have to get an RTX A4000 16Gb instead ). One of the returned P100s was subsequently checked to be ok after I had returned it (on an HP Proliant DL380 server) - this particular vendor used geekbench (screenshot attached wrt my workstation with incompatible P100 installed) . Each incompatible unit had the following symptoms (screenshots attached):

  1. in device manager, display adapter, P100 is recognised, but : “This device cannot find enough free resources that it can use. (Code 12)”
  2. C: prompt nvidia-smi Failed to initialize NVML:: Not found
  3. GPU-Z: BIOS version Unknown
  4. nvflash64
    NVIDIA display adapters present in system:
    No NVIDIA display adapters found.



    4

You’ve probably already covered the following points, but I offer them based on some limited experience and information I’ve come across.

The “This device cannot find enough free resources that it can use. (Code 12)” message is possibly indicating the same information as outlined under Linux in the first post of this thread - that BAR1 is not configured correctly.

Looking at Table 3, on page 3 here, it seems that BAR1 can be configured in either “Compute (default)” or “Graphics” mode.

Is it possible that the P100 you can successfully use is configured in “Graphics” mode and that the lower BAR1 size is able to be handled by your motherboard, where the “Compute” mode requirement cannot?

If so, perhaps you could get a vendor to reconfigure the card prior to shipping.

I had a brief look through your motherboad manual and could not find a BIOS setting that may allow BAR adjustment.

There’s a cautionary post along similar lines for the RTX A4000 here.

Thanks for your suggestion. I entered the following instruction at the c prompt:
nvidia-smi --help-query-gpu

within the consequent help text:

""Section about driver_model properties
On Windows, the TCC and WDDM driver models are supported. The driver model can be changed with the (-dm) or (-fdm) flags. The TCC driver model is optimized for compute applications. I.E. kernel launch times will be quicker with TCC. The WDDM driver model is designed for graphics applications and is not recommended for compute applications. Linux does not support multiple driver models, and will always have the value of “N/A”. Only for selected products. Please see feature matrix in NVML documentation.

“driver_model.current”
The driver model currently in use. Always “N/A” on Linux.“”

Then I entered the following instructions for the working P100 (pls see screenshot):

nvidia-smi

and to assess/confirm further:

nvidia-smi --format=csv --query-gpu=driver_model.current

So it looks like the working P100 is in compute mode
Thank you for the RTX A4000 reference

Sorry, I’m out of suggestions now. It would be interesting to know the VBIOS version on the non working cards - perhaps there have been changes there.

I’ve also just realised I pointed you to a thread about the RTX A6000, not A4000.

Now fixed from your suggestion- details to follow. Thanks for your invaluable help

Turns out that the readings TCC and WDDM were red herrings

The challenge was getting access to the “incompatible” P100. Fortunately my son is a keen gamer and has a recent mobo with a BIOS with “resize BAR” option:

viber_image_2022-11-18_20-43-58-742

I installed the incompatible P100 in my son’s machine and I switched on “resize BAR” in my son’s BIOS. The P100 loaded up correctly. This gave me access to the incompatible P100. It had actually been loaded with an identical vBIOS. I then ran nvidia-smi -q on each of the working and incompatible P100s and did a file compare in Notepad++. The only differences were:

I then switched off ECC in the Nvidia control panel for the incompatible P100, then using NVflash64, I changed the gpumode from “compute” to “graphics”, as follows:

nvflash64 --save backup.rom
nvflash64 --gpumode graphics

screenshot of the outcome:

What was an “incompatible” P100 now works with my much older machine :).

Glad it’s worked out. For what it’s worth, disabling ECC is very likely optional - it will be the mode that’s critical.

If you’re doing work requiring accuracy, you should be able to enable ECC.

1 Like

yes - I can switch on ECC without trouble. For info, all, consequent to the steps above, for the reconfigured, now working P100, it is TCC, and resizeable BAR is disabled: