GTX 2070 as eGPU not recognized by nvidia-settings on ThinkPad P70

Hello,

I am trying to get an RTX 2070 running over ThunderBolt in a Sonnet eGFX breakaway box, connected to a ThinkPad P70. The card and the box are working fine on a ThinkPad P52s under Ubuntu 18.04 (NVIDIA driver 418.x, CUDA 10.0), so the problem appears to be specific to the P70.

I do not need graphics on the RTX 2070, just CUDA.

The P70 is also running Ubuntu 18.04 and has a M3000M installed, which is working fine with drivers 410.x and 418.x.

The kernel appears to recognize that the device is connected. Nouveau is blacklisted via /etc/modprobe.d/blacklist-nvidia-nouveau.conf, whose contents are:

blacklist nouveau
options nouveau modeset=0

Some output relevant to the RTX 2070 with lspci -v is:

0a:00.0 VGA compatible controller: NVIDIA Corporation Device 1f07 (rev a1) (prog-if 00 [VGA controller])
	Subsystem: eVga.com. Corp. Device 2172
	Flags: bus master, fast devsel, latency 0, IRQ 18
	Memory at b4000000 (32-bit, non-prefetchable) 
	Memory at <unassigned> (64-bit, prefetchable)
	Memory at a0000000 (64-bit, prefetchable) 
	I/O ports at 2000 
	[virtual] Expansion ROM at b5080000 [disabled] 
	Capabilities: [60] Power Management version 3
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Legacy Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [250] Latency Tolerance Reporting
	Capabilities: [258] L1 PM Substates
	Capabilities: [128] Power Budgeting <?>
	Capabilities: [420] Advanced Error Reporting
	Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
	Capabilities: [900] #19
	Capabilities: [bb0] #15
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Though the device is recognized, there is no kernel driver in use.

The output of lshw -c display is:

*-display UNCLAIMED
       description: VGA compatible controller
       product: NVIDIA Corporation
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:0a:00.0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm msi pciexpress vga_controller bus_master cap_list
       configuration: latency=0
       resources: memory:b4000000-b4ffffff memory:a0000000-a1ffffff ioport:2000(size=128) memory:b5080000-b50fffff

It appears that the card is being recognized, just not attached to a driver. Any help getting the driver to recognize the external RTX 2070 would be greatly appreciated.

nvidia-bug-report.log.gz (1.04 MB)

I tried changing the xorg configuration file /usr/share/X11/xorg.conf.d/10-nvidia.conf as so:

Section "OutputClass"
    Identifier "nvidia"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "AllowEmptyInitialConfiguration"
    Option "AllowExternalGpus" "true"
    ModulePath "/usr/lib/x86_64-linux-gnu/nvidia/xorg"
EndSection

but nothing was different.

Please run nvidia-bug-report.sh as root and attach the resulting .gz file to your post. Hovering the mouse over an existing post of yours will reveal a paperclip icon.
https://devtalk.nvidia.com/default/topic/1043347/announcements/attaching-files-to-forum-topics-posts/

The requested file has been attached, thank you.

You’re running into this:

[   18.507215] pci 0000:0a:00.0: BAR 1: no space for [mem size 0x10000000 64bit pref]
[   18.507217] pci 0000:0a:00.0: BAR 1: failed to assign [mem size 0x10000000 64bit pref]

The kernel fails to properly assing a memory region so the driver doesn’t load:

Apr 15 12:40:38 thinkpad kernel: [   18.132460] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
Apr 15 12:40:38 thinkpad kernel: [   18.132460] NVRM: BAR1 is 0M @ 0x0 (PCI:0000:0a:00.0)
Apr 15 12:40:38 thinkpad kernel: [   18.132460] NVRM: The system BIOS may have misconfigured your GPU.

This is a kernel bug that was introduced at some time, happens more often lately. Unfortunately, IDK of a real fix for that, you can only test if downgrading the kernel to e.g. 4.14 or upgrading helps. Of course, checking if a bios update helps doesn’t hurt either.

Thanks for that.

I’ve updated the kernel from 4.15.something to 4.19.34, and I also made sure my bios was updated (it now is), but the message still persisted in both instances.

I understand that there might be a thunderbolt firmware update that isn’t applied. Is it possible that such a thing would be the cause of these issues? As far as I know, it’s a major hassle to update thunderbolt on my machine, as I believe I would need to acquire and install Windows to do so …

It’s unlikely that it’s thunderbolt-related. You can only try an earlier kernel.