Can driver be made to load without resize bar?

I’m running a HP ProLiant DL360p Gen8 with dual Xeon E5-2697v2’s and 768G of ram. I updated the bios to latest (May 2024).
I work with a lot of data processing, and so far not worried about processing time on a legacy system, but having the system resources to handle the data.
I’ve been getting into AI, and while I can run it on CPU, it’s dreadfully slow so I want to upgrade the system with a compute card.
I’ve obtained a nVidia Tesla A2 card that the system can run, but I cannot get the linux kernel driver to load.

nvidia: loading out-of-tree module taints kernel.
nvidia: module license ‘NVIDIA’ taints kernel.
Disabling lock debugging due to kernel taint
nvidia: module license taints kernel.
nvidia-nvlink: Nvlink Core is being initialized, major device number 243

nvidia 0000:07:00.0: enabling device (0040 → 0042)
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:\x0aNVRM: BAR1 is 0M @ 0x0 (PCI:0000:07:00.0)
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:\x0aNVRM: BAR2 is 0M @ 0x0 (PCI:0000:07:00.0)
NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:\x0aNVRM: BAR5 is 0M @ 0x0 (PCI:0000:07:00.0)
NVRM: loading NVIDIA UNIX x86_64 Kernel Module 565.77 Wed Nov 27 23:33:08 UTC 2024
nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 565.77 Wed Nov 27 22:53:48 UTC 2024
[drm] [nvidia-drm] [GPU ID 0x00000700] Loading driver
[drm] Initialized nvidia-drm 0.0.0 for 0000:07:00.0 on minor 2

Every post I’ve come across says I would need to enable the resize bar (above 4G 64bit) options in the BIOS. However, if I enable this option, my machine fails to post with an invalid opcode error and forces me to reset the nvram to even get back into BIOS.

I’ve tried adding the kernel command line arguments pci=realloc and pci=realloc=off, but neither of these changes the information in the kernel.
nvidia-bug-report.log.gz (144.3 KB)

My limited understanding is that I might not get full speed of the card without resize bar, but it should still be faster than the CPU’s doing tensor flow or stable diffusion operations.

You might need to disable CSM if you enable 4G decoding / Rebar

Correct me if I’m wrong, but CSM is part of a UEFI based system. This is not a UEFI based hardware, but older BIOS hardware.

Additionally, I’ve searched for such an option, and cannot find any reference to something to disable.

True, CSM is for EFI bios.
I’m not aware of that above 4G decoding is required for a card to work… But Tesla A2 might be different.

There’s a module option to turn off Rebar in the driver, perhaps try that.

Do you know where the documentation on that option is? Or the option itself?

I’m not by the computer right now but I think typing ‘modinfo nvidia’ should tell you.

That gave me some information. An additional search pointed me to:
/proc/drivers/nviida/params

But if I’m reading that correct, it’s already off?

ResmanDebugLevel: 4294967295
RmLogonRC: 1
ModifyDeviceFiles: 1
DeviceFileUID: 0
DeviceFileGID: 27
DeviceFileMode: 432
InitializeSystemMemoryAllocations: 1
UsePageAttributeTable: 4294967295
EnableMSI: 1
EnablePCIeGen3: 0
MemoryPoolSize: 0
KMallocHeapMaxSize: 0
VMallocHeapMaxSize: 0
IgnoreMMIOCheck: 0
TCEBypassMode: 0
EnableStreamMemOPs: 0
EnableUserNUMAManagement: 1
NvLinkDisable: 0
RmProfilingAdminOnly: 1
PreserveVideoMemoryAllocations: 1
EnableS0ixPowerManagement: 0
S0ixPowerManagementVideoMemoryThreshold: 256
DynamicPowerManagement: 3
DynamicPowerManagementVideoMemoryThreshold: 200
RegisterPCIDriver: 1
EnablePCIERelaxedOrderingMode: 0
EnableResizableBar: 0
EnableGpuFirmware: 18
EnableGpuFirmwareLogs: 2
RmNvlinkBandwidthLinkCount: 0
EnableDbgBreakpoint: 0
OpenRmEnableUnsupportedGpus: 1
DmaRemapPeerMmio: 1
ImexChannelCount: 2048
CreateImexChannel0: 0
RegistryDwords: “”
RegistryDwordsPerDevice: “”
RmMsg: “”
GpuBlacklist: “”
TemporaryFilePath: “/var/tmp”
ExcludedGpus: “”

Yeah, looks like it’s off. I wish I knew. Have you tried searching for hp proliant and Nvidia?

I’ve tried several times, but no luck for a compatible match of information. The handful that get close regard newer generations of the platform. If my system was not Gen 8 but Gen 9 it’d be compatible with resize-bar and the errors are as you’d mentioned the need to also disable CSM.
But since I’m on Gen 8, the CPU’s don’t support the resize bar instructions (hence the invalid opcode).

It makes me wonder if the information in /proc/drivers/nivida/params is inaccurate due to the driver throwing its errors about memory during load. I’ll dig to see how to add the right config file to tell the driver’s EnableResizeBar = 0 when it loads.

You need to set NVreg_EnableResizableBar=0 as module option for the nvidia module
How you do that depends on your distribution.

Sorry it’s been a while. I’ve tried adding the options to my grub kernel bootline options via:
nvidia.nvreg_enableresizablebar=0 and
nvidia.enableresizablebar=0.
I’ve also tried adding it to my linux’s (gentoo) modules.conf files in both versions.
The driver still complains about the bar regions having a 0M size.
In one combination the information in dmesg said there were 5 bar regions instead of 3, so the entries did something, but still could not get the driver to load.

Does it need a combination of options to convince it?
Could there be something else in my kernel options that needs to be changed / recompiled to get the driver to load?

Thanks for all the help so far, and hopefully we can figure out the magic combination to get this to load.

I don’t know if the card requires Above 4G decoding or not… I assume your bios don’t have that setting?
From the snippet above I don’t actually see the driver failing to load, what does nvidia-smi say?
If someone from Nvidia could give any hints regarding this issue it would be nice.

As I’ve mentioned my bios has a hidden option for above 4g decoding. But enabling it causes the system to halt during post with a invalid opcode and requires me to reset the cmos to recover. I’ve found other similar systems from hp in searches about this with the same cpu and mention a dip switch setting. My manual does not have the similar setting but I’ve tried it anyway with no change in result.

Nvidia-smi only says no cards found. The module shows loaded but I’m thinking it can’t finish due to the memory mapping errors.

With older Tesla’s, to work around the problem you’re seeing, switching the card to “Graphics” mode worked.

For example, looking at Table 3 in the Product Brief for the Tesla P100, you can see the BAR1 memory requirement is considerably lower in “Graphics” mode.

Looking at Table 3 on the A2 brief, there is no such option mentioned. That’s not to say it does not work perhaps, so worth trying.

The unfortunate thing, is that Teslas ship in “Compute” mode by default. Switching modes is done via nvidia-smi and as you can’t run that yet, you need to do the switch on a machine that does support the card.

See here for someone’s experience with the P100.

1 Like

Thanks for the information, I’ve gotten the card in a spare computer so I can try to change those settings. I did see in lspci that the card should support different BAR1 ranges. With a win10 temporary install on a machine that supports above 4g decoding / resize bar, I can get nvidia-smi to give info about the card. Following the instructions in the “experience with P100” link you provided, I’ve tried using the nvflash64 utility to change the mode to graphics, but it’s returning not supported on this card.

C:\Users\test\Downloads\nvflash_5.821>nvflash64.exe --gpumode graphics
NVIDIA Firmware Update Utility (Version 5.821.0)
Copyright (C) 1993-2023, NVIDIA Corporation. All rights reserved.


WARNING: This operation updates the firmware on the board and could make
         the device unusable if your host system lacks the necessary support.

Are you sure you want to continue?
Press 'y' to confirm (any other key to abort):
y
Specifed GPU Mode "physical_display_enabled_256MB_bar1"


Update GPU Mode of all adapters to "physical_display_enabled_256MB_bar1"?
Press 'y' to confirm or 'n' to choose adapters or any other key to abort:
y

Updating GPU Mode of all eligible adapters to "physical_display_enabled_256MB_bar1"

Graphics Device      (10DE,25B6,10DE,157E) S:00,B:01,D:00,F:00

Specified GPU mode not supported on this device 0x25B6.```

I guess that explains no mention of it in the Product Brief. I guess that means your only option is a PC with the necessary support. Worth trying though.

I agree. There was another bit of post I’d found discussing an opensource driver and how it will select always the highest bar mode in the list. I wonder if the official driver is doing the same and if/what the option is to tell the driver to default to a different mode?
I’ll have to search for the post again (not on my main computer currently) about the opensource driver behavior.

Otherwise I’ve found some suggestions that on a different HP model (same CPU’s) there was a dip switch to set along with above 4G to get it to enable. However, even though that dip switch is listed as unused on my board, that failed to get the above 4G option to not throw the invalid opcode. Perhaps there’s a different combination necessary to get it enabled on my particular system, but horribly undocumented. And considering I’m a home user without an HP maintenance plan, asking them for support is likely to be difficult, but I might see if there is a forum (reddit or otherwise) that might do it.

1 Like