Tesla P40 in Dell Percision 7910 rack

mrenzyme · February 2, 2024, 2:09am

I am recently started exploring OpenAi Whisper . I quickly found that my simple PC did not have the recourses to run medium / large ASR models. So I turned to eBay and picked up a Dell 7910 (rebadged r730) and a Tesla P40 24GB.

The issue, I cannot get the nVidia data center drivers to install properly on either Ubuntu nor Windows.

I followed this video:

I am running Ubuntu 22.04 and installed the nvidia-driver-535 and nvidia-driver-470. But entering “nvidia-smi” results in “No device found”. The server is detecting the GPU:

lspci | grep -i nvidia
83:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)

and

sudo lshw -c video
*-display                 
   description: VGA compatible controller
   product: G200eR2
   vendor: Matrox Electronics Systems Ltd.
   physical id: 0
   bus info: pci@0000:0b:00.0
   logical name: /dev/fb0
   version: 01
   width: 32 bits
   clock: 33MHz
   capabilities: pm vga_controller bus_master cap_list rom fb
   configuration: depth=32 driver=mgag200 latency=64 maxlatency=32 mingnt=16 resolution=1600,1200
   resources: irq:19 memory:90000000-90ffffff memory:91800000-91803fff memory:91000000-917fffff memory:c0000-dffff
*-display
   description: 3D controller
   product: GP102GL [Tesla P40]
   vendor: NVIDIA Corporation
   physical id: 0
   bus info: pci@0000:83:00.0
   version: a1
   width: 64 bits
   clock: 33MHz
   capabilities: pm msi pciexpress bus_master cap_list
   configuration: driver=nvidia latency=0
   resources: iomemory:3f00-3eff iomemory:3f80-3f7f irq:116 memory:c8000000-c8ffffff

I have also tried this under Windows 11. I installed the Data center drivers 538.15 as well as 474.64. The server detects the GPU (GPU-Z shows all the correct information), but device manager shows that the driver failed to start “Insufficient System Resources Exist to Complete the API”

I searched the Nvidia forums and there didn’t appear appear to be any resolution. I don’t intend to run any VM, I don’t intend to run any virtualization on this GPU, I only want to perform CUDA calculations using PyTorch.

rs277 · February 2, 2024, 4:20am

If syslog contains the same BAR1 errors as the OP quoted here, you are probably suffering from a BIOS that either doesn’t support the BAR requirements, (Table 3), or it’s incorrectly configured.

Robert_Crovella · February 2, 2024, 2:21pm

this may be of interest

mrenzyme · February 3, 2024, 1:45am

rs277,
I did check to see if I my “Memory Mapped I/O above 4GB” was enabled:

(I did try to enable the “Lower Memory Mapped I/O Base to 512GB” as well)

Robert_Crovella,

sudo lspci -vvv
83:00.0 3D controller: NVIDIA Corporation GP102GL [Tesla P40] (rev a1)
	Subsystem: NVIDIA Corporation GP102GL [Tesla P40]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 116
	NUMA node: 1
	Region 0: Memory at c8000000 (32-bit, non-prefetchable) [size=16M]
	Region 1: Memory at f000000000 (64-bit, prefetchable) [size=32G]
	Region 3: Memory at f800000000 (64-bit, prefetchable) [size=32M]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

It looks like all the regions are assigned.

nvidia-smi still results in “No device found”

rs277 · February 3, 2024, 3:16am

The lspci seems to show BARs allocated OK. What about driver related errors in the syslog?

mrenzyme · February 3, 2024, 2:25pm

rs277,

Sorry, linux noob here (had to search how to view syslog), I filtered all the relivant “nvidia” lines from the syslog, I can post more from the syslog if you think I missed something:

journalctl|grep "nvidia"
Feb 03 08:44:53 bze-server kernel: nvidia: loading out-of-tree module taints kernel.
Feb 03 08:44:53 bze-server kernel: nvidia: module license 'NVIDIA' taints kernel.
Feb 03 08:44:53 bze-server kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Feb 03 08:44:53 bze-server kernel: nvidia: module license taints kernel.
Feb 03 08:44:53 bze-server kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
Feb 03 08:44:53 bze-server kernel: audit: type=1400 audit(1706967893.231:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=772 comm="apparmor_parser"
Feb 03 08:44:53 bze-server kernel: audit: type=1400 audit(1706967893.231:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=772 comm="apparmor_parser"
Feb 03 08:44:53 bze-server audit[772]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=772 comm="apparmor_parser"
Feb 03 08:44:53 bze-server audit[772]: AVC apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=772 comm="apparmor_parser"
Feb 03 08:44:53 bze-server kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  535.154.05  Thu Dec 28 15:51:29 UTC 2023
Feb 03 08:44:53 bze-server kernel: [drm] [nvidia-drm] [GPU ID 0x00008300] Loading driver
Feb 03 08:44:53 bze-server kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008300] Failed to allocate NvKmsKapiDevice
Feb 03 08:44:53 bze-server kernel: [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008300] Failed to register device
Feb 03 08:44:53 bze-server kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
Feb 03 08:44:53 bze-server kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
Feb 03 08:44:53 bze-server nvidia-persistenced[1034]: Verbose syslog connection opened
Feb 03 08:44:53 bze-server nvidia-persistenced[1034]: Now running with user ID 129 and group ID 137
Feb 03 08:44:53 bze-server nvidia-persistenced[1034]: Started (1034)
Feb 03 08:44:53 bze-server nvidia-persistenced[1034]: device 0000:83:00.0 - registered
Feb 03 08:44:53 bze-server nvidia-persistenced[1034]: Local RPC services initialized
Feb 03 08:45:09 bze-server nvidia-settings-autostart.desktop[2766]: ERROR: A supplied argument is invalid
Feb 03 08:53:28 bze-server sudo[6671]:    brett : TTY=pts/0 ; PWD=/home/brett ; USER=root ; COMMAND=/usr/bin/nvidia-smi

I think the important ERROR lines are:

Feb 03 08:44:53 bze-server kernel: [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008300] Failed to allocate NvKmsKapiDevice
Feb 03 08:44:53 bze-server kernel: [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00008300] Failed to register device

and

Feb 03 08:45:09 bze-server Feb 03 08:45:09 bze-server nvidia-settings-autostart.desktop[2766]: ERROR: A supplied argument is invalid[2766]: ERROR: A supplied argument is invalid

Not sure how to fix the first two, but the desktop entry referenced in the third error is

cat /etc/xdg/autostart/nvidia-settings-autostart.desktop
[Desktop Entry]
Type=Application
Encoding=UTF-8
Name=NVIDIA X Server Settings
Comment=Configure NVIDIA X Server Settings
Exec=sh -c ‘/usr/bin/nvidia-settings --load-config-only’
Terminal=false
Icon=nvidia-settings
Categories=System;Settings;

Just running sh -c ‘/usr/bin/nvidia-settings’ pops open a nice looking nvidia GUI, but I get the following error in the terminal:

sh -c '/usr/bin/nvidia-settings'

ERROR: A supplied argument is invalid


(nvidia-settings:11033): GLib-GObject-CRITICAL **: 09:18:04.716: g_object_unref: assertion 'G_IS_OBJECT (object)' failed

** (nvidia-settings:11033): CRITICAL **: 09:18:04.718: ctk_powermode_new: assertion '(ctrl_target != NULL) && (ctrl_target->h != NULL)' failed

ERROR: nvidia-settings could not find the registry key file or the X server is not
       accessible. This file should have been installed along with this driver at
       /usr/share/nvidia/nvidia-application-profiles-key-documentation. The
       application profiles will continue to work, but values cannot be
       prepopulated or validated, and will not be listed in the help text. Please
       see the README for possible values and descriptions.

** Message: 09:18:04.767: PRIME: No offloading required. Abort
** Message: 09:18:04.767: PRIME: is it supported? no

So I am stuck again. I really appreciate your help as it seems this has been asked several times before. I assumed that since someone was able to get the P40 running in a r720 that I would be able to get it working in a generation newer enterprise server.

So I bought both the server and the P40 used… separately. Since Tesla cards are harder to debug as one cannot simply plug in a display to see if it is working, is there any possibility that the card is detecting but not loading the driver due to physical damage?

Robert_Crovella · February 3, 2024, 3:03pm

what is the result of dmesg |grep NVRM ?

I would start over with a fresh load of Ubuntu 22.04, then load the GPU driver using the CUDA runfile installer. Things get a little complicated after CUDA 12.2 so if it were me I would use the CUDA runfile installer for CUDA 12.2 for diagnostic purposes/initial test.

I wouldn’t install X support or anything like that as an initial diagnostic.

Also, just because the 7910 “sees” the GPU via lspci, does not mean it recognizes it or knows how to keep it cool from a SBIOS perspective. Before doing the fresh OS load, I would make sure the 7910 has the latest BIOS available from Dell installed. You may need to provide a cooling mechanism for the card depending on what shrouding is in that server. If you have proper shrouding it may be sufficient to force fans to maximum if the server has such a setting. Have you provided the necessary aux power to the GPU?

Do as you wish, of course.

mrenzyme · February 5, 2024, 1:32am

dmesg |grep NVRM

Wow, that is very informative!!!

[   63.223468] NVRM: GPU 0000:83:00.0: GPU does not have the necessary power cables connected.
[   63.223820] NVRM: GPU 0000:83:00.0: RmInitAdapter failed! (0x24:0x1c:1436)
[   63.223859] NVRM: GPU 0000:83:00.0: rm_init_adapter failed, device minor number 0

So either

I have the wrong power cable. I am using a PCIE to 12VEPS labeled K80 to r730 (PCIE GPU 8Pin to 8Pin Power Cable For DELL R730 to Nvidia K80/M40/M60/P40/P100 | eBay). This cable has the same connector on each end but wired differently on each. I took care to pug the connector with the four yellow in a row into the GPU and the other into the PCIe riser (see image)
The power cable is fine and the power input stage is damaged (assuming onboard fuses)

As far as your other concerns, Sat morning I wiped the hdd and installed a fresh 22.04. The server still has all the OEM shrouds. Dell does this cool (lol) thing where is detects a gpu on the PCI bus and it ramps the fans to 50%… its like a hurricane blowing out the back of the GPU!!!

I am having trouble installing the CUDA toolkit from the .run file. I was getting kernal errors, I will report back tomorrow, but the dmesg seems more like the smoking gun!

Do as you wish, of course.

I really appreciate your help!!

njuffa · February 5, 2024, 2:13am

The 8-pin PCIe power connector is rated for 150W, while the 12VEPS connector is rated for 336W. According to the specs the P40 has a power draw of 250W, which theoretically could be supplied via the PCIe slot (according to the spec, it can supply up to 75W, but most NVIDIA GPUs are designed to draw < 40W through the slot) plus the 8-pin PCIe cable (<= 150W). However, power spikes of short duration are common with modern GPUs so in practice that is not something you would want to do.

I have not read up on the details12VEPS connectors. They may include keying (square vs rounded shrouding) and have voltages assigned to pins differently than the PCIe 8-pin connector. The combination of these features likely allows the GPU to sense whether there is a genuine 12VEPS power cable attached, which is not the case in your setup. The GPU probably also monitors its supply voltage, and a power draw of > 150W on a PSU output designed for 150W max likely leads to voltage drops (“brown-outs”) on the PCIe auxiliary power connection.

Generally speaking, for reliability the use of adapters, splitters, and daisy-chaining in PCIe auxilliary power for GPUs should be avoided. “Up-conversion”, e.g. 6-pin PCIe to 8-pin PCIe or 8-pin PCIe to 12VEPS, should most definitely be avoided as it violates the electrical specifications. Use a power supply that provides the correct 12VEPS cable.

Note: The use of PCIe riser boards is sub-optimal as passive intermediate connectors tend to degrade PCIe signal quality. If this is a Dell-provided riser card your are probably OK.

mrenzyme · February 16, 2024, 2:30am

Sorry for the delay.

The 8-pin PCIe power connector is rated for 150W, while the 12VEPS connector is rated for 336W
The use of PCIe riser boards is sub-optimal as passive intermediate connectors

Just to clear things up. The “risers” I am referring to are the OEM dell PCI risers that provide two x16 slots and a 250W PCI connector. The PCIe provides 75W so the P40 is getting 325W, maybe not the best under peak load.

Use a power supply that provides the correct 12VEPS cable

My two PSUs slide directly into the server mobo… no cables needed/avialable. The cable that I was using is an OEM Dell r730-to-K80 gpu power cable.

I bought a second P40 just to be sure that the ebay card was not faulty… suprise suprise the second card didn’t work either

The solution
I bought the power adapter cables listed in the video and both P40 work. The drivers load, PyTorch recognizes them. I ran both simultaneously for about 3 hours today and they reached a peak temperature of ~50C.

system · March 1, 2024, 2:31am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tesla S1070 under RH 5.3 S1070 not detected correctly by T5500 CUDA Programming and Performance	3	3101	July 6, 2009
Install driver for 2 GPUs CUDA Setup and Installation	10	2361	September 20, 2018
Tesla C1060 on asus P5ld2 "There is no device supporting cuda" CUDA Programming and Performance	6	11282	October 8, 2009
Tesla K40c: Linux Kernel crashed once run 'nvidia-smi' CUDA Setup and Installation	14	3473	December 12, 2016
Tesla card on Lucid Lynx - no CUDA-capable device is detected CUDA Programming and Performance	18	19875	February 2, 2011
Cuda error on xp 64 Tescla C1060 (with GF 7900 GS) CUDA Programming and Performance	22	23483	March 29, 2010
How can I get the arm64 driver for Tesla P4 on Jetson Nano? Jetson Nano	9	1141	October 18, 2021
K80 crashed or wrong computation results on K80 CUDA Programming and Performance	13	5023	September 20, 2015
Driver Installation for Tesla K80 - Problems CUDA Setup and Installation	17	6926	January 18, 2020
Need Help with P100 installation (R730 Dell) CUDA Setup and Installation	8	2069	August 18, 2023

Tesla P40 in Dell Percision 7910 rack

Related topics