GPU in chassis not being seen by drivers

We’ve been running two A100 GPUs in our system with no problem. We are now trying to get one of the GPUs to run in an extension chassis. The cards show up in lspci but the one in the chassis has no driver associated with it:

user@syseng-2-dell-hpc:~$ lspci -v -s  25:00.0 
25:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
	Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
	Physical Slot: 2-1
	Flags: bus master, fast devsel, latency 0, IRQ 774, NUMA node 0, IOMMU group 29
	Memory at 98000000 (32-bit, non-prefetchable) [size=16M]
	Memory at 1e000000000 (64-bit, prefetchable) [size=128G]
	Memory at 1d000000000 (64-bit, prefetchable) [size=32M]
	Capabilities: <access denied>
	Kernel driver in use: nvidia
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

user@syseng-2-dell-hpc:~$ lspci -v -s  55:00.0 
55:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
	Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
	Flags: bus master, fast devsel, latency 0, IRQ 51, NUMA node 0, IOMMU group 70
	Memory at <ignored> (32-bit, non-prefetchable)
	Memory at <ignored> (64-bit, prefetchable)
	Memory at <ignored> (64-bit, prefetchable)
	Capabilities: <access denied>
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

The following is the nvidia-smi:

user@syseng-2-dell-hpc:~$ nvidia-smi
Tue Jan 31 16:19:58 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.78.01    Driver Version: 525.78.01    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  Off  | 00000000:25:00.0 Off |                    0 |
| N/A   36C    P0    46W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

The following is the dmesg output:

user@syseng-2-dell-hpc:~$ sudo dmesg | grep nvidia
[   28.512680] nvidia: loading out-of-tree module taints kernel.
[   28.512696] nvidia: module license 'NVIDIA' taints kernel.
[   28.533721] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   28.554023] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[   28.925163] nvidia: probe of 0000:55:00.0 failed with error -1
[   28.957357] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  525.78.01  Mon Dec 26 05:38:56 UTC 2022
[   28.964860] [drm] [nvidia-drm] [GPU ID 0x00002500] Loading driver
[   30.615877] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:25:00.0 on minor 1
[   33.110011] nvidia_uvm: module uses symbols from proprietary module nvidia, inheriting taint.
[   33.122440] nvidia-uvm: Loaded the UVM driver, major device number 507.
[   33.408423] audit: type=1400 audit(1675181119.445:3): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1898 comm="apparmor_parser"
[   33.408428] audit: type=1400 audit(1675181119.445:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1898 comm="apparmor_parser"

Any recommendations at what to look at would be greatly appreciated. Please let me know if there is any other information that would be helpful.

Tony

user@syseng-2-dell-hpc:~$ sudo lspci -vvv -s 55:00.0 
55:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 80GB] (rev a1)
	Subsystem: NVIDIA Corporation GA100 [A100 PCIe 80GB]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 51
	NUMA node: 0
	IOMMU group: 70
	Region 0: Memory at <ignored> (32-bit, non-prefetchable)
	Region 1: Memory at <ignored> (64-bit, prefetchable)
	Region 3: Memory at <ignored> (64-bit, prefetchable)
	Capabilities: [60] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [68] Null
	Capabilities: [78] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 25.000W
		DevCtl:	CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x16, ASPM not supported
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 16GT/s (ok), Width x16 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range AB, TimeoutDis+ NROPrPrP- LTR+
			 10BitTagComp+ 10BitTagReq+ OBFF Via message, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkCap2: Supported Link Speeds: 2.5-16GT/s, Crosslink- Retimer+ 2Retimers+ DRS-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+
			 EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [c8] MSI-X: Enable- Count=6 Masked-
		Vector table: BAR=0 offset=00b90000
		PBA: BAR=0 offset=00ba0000
	Capabilities: [100 v1] Virtual Channel
		Caps:	LPEVC=0 RefClk=100ns PATEntryBits=1
		Arb:	Fixed- WRR32- WRR64- WRR128-
		Ctrl:	ArbSelect=Fixed
		Status:	InProgress-
		VC0:	Caps:	PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
			Arb:	Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
			Ctrl:	Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
			Status:	NegoPending- InProgress-
	Capabilities: [250 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [258 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
			  PortCommonModeRestoreTime=255us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=10us
	Capabilities: [128 v1] Power Budgeting <?>
	Capabilities: [420 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn- PerformEqu-
		LaneErrStat: 0
	Capabilities: [bb0 v1] Physical Resizable BAR
		BAR 0: current size: 16MB, supported: 16MB
		BAR 1: current size: 128GB, supported: 64MB 128MB 256MB 512MB 1GB 2GB 4GB 8GB 16GB 32GB 64GB 128GB
		BAR 3: current size: 32MB, supported: 32MB
	Capabilities: [bcc v1] Single Root I/O Virtualization (SR-IOV)
		IOVCap:	Migration-, Interrupt Message Number: 000
		IOVCtl:	Enable- Migration- Interrupt- MSE- ARIHierarchy+
		IOVSta:	Migration-
		Initial VFs: 20, Total VFs: 20, Number of VFs: 0, Function Dependency Link: 00
		VF offset: 4, stride: 1, Device ID: 20b5
		Supported Page Size: 00000573, System Page Size: 00000001
		Region 1: Memory at 0000000000000000 (64-bit, prefetchable)
		Region 3: Memory at 0000000000000000 (64-bit, prefetchable)
		VF Migration: offset: 00000000, BIR: 0
	Capabilities: [c14 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [c1c v1] Physical Layer 16.0 GT/s <?>
	Capabilities: [d00 v1] Lane Margining at the Receiver <?>
	Capabilities: [e00 v1] Data Link Feature <?>
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

The following is a snippet from dmesg:

[    5.067037] pci 0000:55:00.0: BAR 1: no space for [mem size 0x2000000000 64bit pref]
[    5.067040] pci 0000:55:00.0: BAR 1: failed to assign [mem size 0x2000000000 64bit pref]
[    5.067043] pci 0000:55:00.0: BAR 8: no space for [mem size 0x1400000000 64bit pref]
[    5.067046] pci 0000:55:00.0: BAR 8: failed to assign [mem size 0x1400000000 64bit pref]
[    5.067049] pci 0000:55:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[    5.067051] pci 0000:55:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
[    5.067054] pci 0000:55:00.0: BAR 10: no space for [mem size 0x28000000 64bit pref]
[    5.067057] pci 0000:55:00.0: BAR 10: failed to assign [mem size 0x28000000 64bit pref]
[    5.067060] pci 0000:55:00.0: BAR 0: no space for [mem size 0x01000000]
[    5.067062] pci 0000:55:00.0: BAR 0: failed to assign [mem size 0x01000000]
[    5.067064] pci 0000:55:00.0: BAR 7: no space for [mem size 0x00500000]
[    5.067066] pci 0000:55:00.0: BAR 7: failed to assign [mem size 0x00500000]
[    5.067069] pci 0000:55:00.0: BAR 1: no space for [mem size 0x2000000000 64bit pref]
[    5.067072] pci 0000:55:00.0: BAR 1: failed to assign [mem size 0x2000000000 64bit pref]
[    5.067075] pci 0000:55:00.0: BAR 3: no space for [mem size 0x02000000 64bit pref]
[    5.067078] pci 0000:55:00.0: BAR 3: failed to assign [mem size 0x02000000 64bit pref]
[    5.067080] pci 0000:55:00.0: BAR 0: no space for [mem size 0x01000000]
[    5.067083] pci 0000:55:00.0: BAR 0: failed to assign [mem size 0x01000000]
[    5.067085] pci 0000:55:00.0: BAR 7: no space for [mem size 0x00500000]
[    5.067087] pci 0000:55:00.0: BAR 7: failed to assign [mem size 0x00500000]
[    5.067090] pci 0000:55:00.0: BAR 10: no space for [mem size 0x28000000 64bit pref]
[    5.067092] pci 0000:55:00.0: BAR 10: failed to assign [mem size 0x28000000 64bit pref]
[    5.067095] pci 0000:55:00.0: BAR 8: no space for [mem size 0x1400000000 64bit pref]
[    5.067098] pci 0000:55:00.0: BAR 8: failed to assign [mem size 0x1400000000 64bit pref]

Okay looks like the following solved my problem:

1 Like

You can swap the two GPU, if the nvidia-smi can show Tesla but no GT710, probably the GPU initialization is limited by BIOS configuration. No enough memory map space to support them.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.