Problem getting Nvidea K40 running under Ubuntu 21.10

Hi,

I’ve been struggling to get K40 cards detected under Ubuntu 21.10 on an HP DL580 G7.

I’ve tried to reseat the cards (I have a total of 4 donated K40 cards), tried to insert 1 card at a time and double check the power connections (tried 1x 8pin and 1x8pin +1x6pin at the same time)

Sharing troubleshooting information below, have two k40 cards in the machine.

Would appreciate any pointers, thanks in advance.

The results of nvidia-smi are No devices were found. See below:

$ sudo nvidia-smi
No devices were found

I ran nvidia-bug-report.sh and the results are attached as nvidia-bug-report.log.gz

I was poking in the output of dmesg and am highlighting some sections:

[   44.495239] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  470.103.01  Thu Jan  6 12:10:04 UTC 2022
[   44.518197] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  470.103.01  Thu Jan  6 12:12:52 UTC 2022
[   44.762777] [drm] ib test succeeded in 0 usecs
[   44.763943] [drm] No TV DAC info found in BIOS
[   44.764026] [drm] Radeon Display Connectors
[   44.764034] [drm] Connector 0:
[   44.764039] [drm]   VGA-1
[   44.764045] [drm]   DDC: 0x60 0x60 0x60 0x60 0x60 0x60 0x60 0x60
[   44.764054] [drm]   Encoders:
[   44.764059] [drm]     CRT1: INTERNAL_DAC1
[   44.764066] [drm] Connector 1:
[   44.764071] [drm]   VGA-2
[   44.764075] [drm]   DDC: 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c 0x6c
[   44.764119] [drm]   Encoders:
[   44.764124] [drm]     CRT2: INTERNAL_DAC2
[   44.809273] [drm] fb mappable at 0xA8040000
[   44.809283] [drm] vram apper at 0xA8000000
[   44.809289] [drm] size 1572864
[   44.809294] [drm] fb depth is 16
[   44.809300] [drm]    pitch is 2048
[   44.994647] [drm] Initialized radeon 2.50.0 20080528 for 0000:01:03.0 on minor 0
[   44.995428] [drm] [nvidia-drm] [GPU ID 0x00001100] Loading driver
[   44.995625] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:11:00.0 on minor 1
[   44.996569] [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
[   44.996759] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0b:00.0 on minor 2
[ 1525.198747] resource sanity check: requesting [mem 0x91700000-0x926fffff], which spans more than PCI Bus 0000:0b [mem 0x91000000-0x91ffffff]
[ 1525.198762] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[ 1525.253780] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[ 1525.253893] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 1
[ 1526.050048] resource sanity check: requesting [mem 0x91700000-0x926fffff], which spans more than PCI Bus 0000:0b [mem 0x91000000-0x91ffffff]
[ 1526.050057] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[ 1526.104605] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[ 1526.104688] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 1
[ 1527.344396] resource sanity check: requesting [mem 0x90700000-0x916fffff], which spans more than PCI Bus 0000:11 [mem 0x90000000-0x90ffffff]
[ 1527.344406] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[ 1527.398940] NVRM: GPU 0000:11:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[ 1527.399023] NVRM: GPU 0000:11:00.0: rm_init_adapter failed, device minor number 0
[ 1528.192348] resource sanity check: requesting [mem 0x90700000-0x916fffff], which spans more than PCI Bus 0000:11 [mem 0x90000000-0x90ffffff]
[ 1528.192358] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[ 1528.247464] NVRM: GPU 0000:11:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[ 1528.247551] NVRM: GPU 0000:11:00.0: rm_init_adapter failed, device minor number 0
[ 2259.710444] resource sanity check: requesting [mem 0x91700000-0x926fffff], which spans more than PCI Bus 0000:0b [mem 0x91000000-0x91ffffff]
[ 2259.710459] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[ 2259.766210] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[ 2259.766323] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 1
[ 2260.570910] resource sanity check: requesting [mem 0x91700000-0x926fffff], which spans more than PCI Bus 0000:0b [mem 0x91000000-0x91ffffff]
[ 2260.570920] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[ 2260.626208] NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[ 2260.626292] NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 1
[ 2261.422806] resource sanity check: requesting [mem 0x90700000-0x916fffff], which spans more than PCI Bus 0000:11 [mem 0x90000000-0x90ffffff]
[ 2261.422815] caller os_map_kernel_space.part.0+0x97/0xa0 [nvidia] mapping multiple BARs
[ 2261.478261] NVRM: GPU 0000:11:00.0: RmInitAdapter failed! (0x24:0xffff:1211)
[ 2261.478343] NVRM: GPU 0000:11:00.0: rm_init_adapter failed, device minor number 0

nvidia-bug-report.log.gz (87.3 KB)

The output of lsb_release is below:

$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 21.10
Release:	21.10
Codename:	impish

The output of uname is below:

$ uname -a
Linux nlp002 5.13.0-39-generic #44-Ubuntu SMP Thu Mar 24 15:35:05 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Please enable Above 4G decoding / 64bit BARs in bios, disable CSM and reinstall the OS in EFI mode.

1 Like

Many thanks @generix

Will do.

You can forget about what I said, checked the manual of that server and it’s really old, no EFI, plain bios. Which wouldn’t hurt but the bios does only support up to 3GB of mmio space so it’s incapable of driving any Tesla. Teslas want to map their whole VRAM to system address space so would need 12GB each, for 4 you’d need 48GB adress space. Wrong server, no dice.

1 Like

Hah, thanks :-)

To put this straight, it doesn’t depend on EFI, there are also servers with old bios that support this. It’s just the bios of this specific server that doesn’t support it.

1 Like

Many thanks @generix – what family of GPUs should I look at, I need something that will let me run tenserflow for machine learning workloads. The lowest CUDA version I think is 3.5 but I reckon look at 5.0 support at least.

Does something like a GeForce GTX 960 require MMIO 4G or will it run on the servers?

Thanks.

Normal graphics cards like Geforce and Quadro products only need 256MB address space so this should work.

1 Like

Once again many thanks!

@generix – could I ask you where you found this information – I was looking at another generations of the servers specifically the HP dl380 g8 – they don’t have EFI from what I can tell but couldn’t figure out if they supported 4G of mmio space.

Thanks!

See:
https://forums.developer.nvidia.com/t/cuda-and-nvidia-driver-setup-failure-on-centos7/81024

Gen 8 servers should support this.

1 Like

Awesome thanks!