I have successfully set up Prime on my desktop system. This was despite using a distro I was not yet familiar with (Gentoo) and that both cards are NVIDIA (no modesetting driver). A GT 1030 drives the monitors and a GTX 1660 Ti for Steam/etc.
However, the performance is really not good. Using glxgears makes this clear.
The following is for the GT 1030.
$ __GL_SYNC_TO_VBLANK=0 glxgears
98089 frames in 5.0 seconds = 19617.654 FPS
98946 frames in 5.0 seconds = 19789.189 FPS
99055 frames in 5.0 seconds = 19810.928 FPS
And the following is for the GTX 1660 Ti.
$ __NV_PRIME_RENDER_OFFLOAD=1 __GL_SYNC_TO_VBLANK=0 glxgears
13003 frames in 5.0 seconds = 2600.484 FPS
12234 frames in 5.0 seconds = 2446.688 FPS
12359 frames in 5.0 seconds = 2471.798 FPS
My expectation was that the GTX 1660 Ti would produce a higher FPS.
I am aware that glxgears is not an appropriate benchmark, so I will add that the benchmark within Deus Ex: Mankind Divided averages 28 FPS at 1920x1200 resolution with Medium settings. I’m told it should be around 90 FPS. nvidia-bug-report.log.gz (86.6 KB)
Just attempted the environment variables you suggested. The results are below.
$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia __VK_LAYER_NV_optimus=NVIDIA_only glxgears
Running synchronized to the vertical refresh. The framerate should be
approximately the same as the monitor refresh rate.
7411 frames in 5.0 seconds = 1482.076 FPS
5240 frames in 5.0 seconds = 1047.898 FPS
5448 frames in 5.0 seconds = 1088.962 FPS
I also attempted with VBLANK disabled.
__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia __VK_LAYER_NV_optimus=NVIDIA_only __GL_SYNC_TO_VBLANK=0 glxgears
6713 frames in 5.0 seconds = 1342.317 FPS
5970 frames in 5.0 seconds = 1193.964 FPS
6467 frames in 5.0 seconds = 1292.364 FPS
As you can see, no improvement.
And in regards to the Deus Ex FPS, I was running it on the GTX 1660Ti. So noticably higher FPS would be reasonable.
First of all, you can’t use glxgears on a prime setup to show anything. The framerate will always be very low due to the frame-copy overhead on very high fps. The results are completely useless.
On prime, always use some full-screen game or a unigine demo.
Second, I don’t know if nvidia2nvidia offloading does work at all or if some additional settings have to be made in order to specify the offload target.
As a first measure, please install nvidia-prime, then use it to check if glxgears/DeusEx is running on the 1660 at all.
My DE is LXDE. I like having something lightweight.
I ran the command nvidis-smi -q during the Deus Ex MD benchmark. Once before using the “iommu=off” kernel parameter and once after. No significant difference.
Below are the TX and RX results from before using the “iommu=off” parameter. (slightly edited for readability)
sleep 60 && nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Mon Apr 12 10:27:58 2021
Driver Version : 460.56
CUDA Version : 11.2
Attached GPUs : 2
GPU 00000000:04:00.0
Product Name : GeForce GT 1030
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled
…
PCI
Bus : 0x04
Device : 0x00
Domain : 0x0000
Device Id : 0x1D0110DE
Bus Id : 00000000:04:00.0
Sub System Id : 0x8C981462
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 4x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 1000 KB/s
Rx Throughput : 526000 KB/s
…
GPU 00000000:0C:00.0
Product Name : GeForce GTX 1660 Ti
Product Brand : GeForce RTX
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
…
PCI
Bus : 0x0C
Device : 0x00
Domain : 0x0000
Device Id : 0x218210DE
Bus Id : 00000000:0C:00.0
Sub System Id : 0x3FBE1458
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 511000 KB/s
Rx Throughput : 404000 KB/s
LXDE is using openbox as WM, so that shouldn’t have any effect on performance.
Just to rule out this is deusex specific, please try with an Unigine demo or another demanding game.
Installed the Unigine Heaven demo using Phoronix Test Suite and ran it. I also confirmed it was running on the 1660.
Here are the results.
Unigine Heaven 4.0:
pts/unigine-heaven-1.6.5 [Resolution: 1920 x 1200 - Mode: Fullscreen - Renderer: OpenGL]
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 17 Minutes [13:00 EDT]
Started Run 1 @ 12:44:06
Started Run 2 @ 12:48:43
Started Run 3 @ 12:53:18
EDIT (4/12/2021 13:35 EDT)
What. The. Frick.
Out of curiosity I started investigating the PCIe lanes of my motherboard. And look what I found.
dmesg | grep PCIe
[ 0.777846] acpi PNP0A08:00: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME LTR]
[ 0.778003] acpi PNP0A08:00: _OSC: OS now controls [AER PCIeCapability]
[ 0.781639] pci 0000:02:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16 GT/s x4 link at 0000:00:01.2 (capable of 126.024 Gb/s with 16 GT/s x8 link)
[ 0.786586] pci 0000:04:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x4 link at 0000:03:00.0 (capable of 31.504 Gb/s with 8 GT/s x4 link)
[ 0.788221] pci 0000:06:00.0: 31.506 Gb/s available PCIe bandwidth, limited by 16 GT/s x2 link at 0000:03:02.0 (capable of 63.012 Gb/s with 16 GT/s x4 link)
[ 0.790776] pci 0000:09:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16 GT/s x4 link at 0000:00:01.2 (capable of 252.048 Gb/s with 16 GT/s x16 link)
[ 0.793545] pci 0000:0a:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16 GT/s x4 link at 0000:00:01.2 (capable of 252.048 Gb/s with 16 GT/s x16 link)
[ 0.793983] pci 0000:0b:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16 GT/s x4 link at 0000:00:01.2 (capable of 252.048 Gb/s with 16 GT/s x16 link)
[ 0.794367] pci 0000:0c:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at 0000:00:03.1 (capable of 126.016 Gb/s with 8 GT/s x16 link)
[ 1.322708] igb 0000:07:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 9c:5c:8e:bc:22:ba
[ 3.841312] nvidia: unknown parameter ‘Vreg_EnablePCIeGen3’ ignored
In case you don’t see the issue, both of the slots hosting the GPU’s are operating at PCIe 1.0 speeds. This MB is capable of 4.0 speeds. GGAAAHHH!!!
And for some salt in the wound, the nvidia driver did not recognize the PCIe 3.0 parameter.
This appears to be outside the context of NVIDIA and its products. So if I disappear for awhile, it is because I am wrestling with either the MB vendor or figuring out more dark secrets to my OS.
You’re misinterpreting the pci speed display. When the nvidia gpu throttles down clocks, it also throttles down pci speed. That’s why nvidia-smi also outputs it, on gpu usage:
PCIe Generation
Max : 3
Current : 3
on idle:
Current : 1
So there’s nothing wrong with your board.
I didn’t recognize this parameter, also don’t have it on my system. google finds !zero! entries about it. This module parameter does not seem to exist.
It is apparently supposed to be NVreg_EnablePCIeGen3 and included in the /etc/modprobe.d/nvidia.conf file.
It is listed on the Gentoo Wiki page.
I am aware the page is likely outdated.
The description on the wiki page for that parameter is wrong.
Some early pcie gen3 chipsets were not working properly so the nvidia driver contains a blacklist for those. The parameter is to override that. So if you don’t run this on a 10 years old mainboard with a broken chipset, that parameter does nothing.
I have a new lead.
After fiddling with Bios settings, kernel compiles and running ‘sleep 60 && nvidia-smi -q’ during benchmarks. I spotted this for both the 1030 and 1660.
Performance State : P0
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
I checked prior tests and it was the same before.
Now if I understand performance states correctly, P0 means the card is at full power. Which makes sense because this is in the middle of a benchmark. And the same printout shows the 1660 clocks at, or near, their Max Hz. But somehow the clock is still “throttled” because it is “Idle”? What determines if an idle clock throttle is active?
(BTW somehow all that prior fiddling got me from 28fps to 32fps in the Deus Ex MD benchmark. So some good came from it.)
“Idle” throttling is always active, meaning powermizer is enabled vs. fixed clocking.
How many fps does glxgears report if you run it normally, i.e. locked to vsync?
__NV_PRIME_RENDER_OFFLOAD=1 glxgears
Running synchronized to the vertical refresh. The framerate should be
approximately the same as the monitor refresh rate.
12856 frames in 5.0 seconds = 2571.128 FPS
14342 frames in 5.0 seconds = 2868.393 FPS
32422 frames in 5.0 seconds = 6484.275 FPS
The final measurement was after the terminal window covered the glxgears window.
Edit (4/20/21)
Poking around I noticed something that may be wrong.
xrandr --setprovideroffloadsink 1 0
X Error of failed request: BadValue (integer parameter out of range for operation)
Major opcode of failed request: 139 (RANDR)
Minor opcode of failed request: 34 (RRSetProviderOffloadSink)
Value in failed request: 0x35f
Serial number of failed request: 16
Current serial number in output stream: 17
Ok.
I had hoped my next post would be to report success. I have spent a lot of time working on this on my own. There have been many dead ends.
I need a question answered.
According to ‘nvidia-smi -q’, on the 1660 TI, the maximum clock speed of the memory is 6001Mhz. But according to the advertised specs the speed should be 12000Mhz on the GDDR6 memory.
This would fit with the GPU usage hovering in the 50% range during benchmarks.
Now I understand that the advertised specs are probably the “effective speed” because of GDDR quirks and marketing.
My question is, does ‘nvidia-smi -q’ show the effective memory clock speed or the actual memory clock speed?