Successful Prime use, but with really poor performance

I have successfully set up Prime on my desktop system. This was despite using a distro I was not yet familiar with (Gentoo) and that both cards are NVIDIA (no modesetting driver). A GT 1030 drives the monitors and a GTX 1660 Ti for Steam/etc.

However, the performance is really not good. Using glxgears makes this clear.
The following is for the GT 1030.

$ __GL_SYNC_TO_VBLANK=0 glxgears
98089 frames in 5.0 seconds = 19617.654 FPS
98946 frames in 5.0 seconds = 19789.189 FPS
99055 frames in 5.0 seconds = 19810.928 FPS

And the following is for the GTX 1660 Ti.

$ __NV_PRIME_RENDER_OFFLOAD=1 __GL_SYNC_TO_VBLANK=0 glxgears
13003 frames in 5.0 seconds = 2600.484 FPS
12234 frames in 5.0 seconds = 2446.688 FPS
12359 frames in 5.0 seconds = 2471.798 FPS

My expectation was that the GTX 1660 Ti would produce a higher FPS.
I am aware that glxgears is not an appropriate benchmark, so I will add that the benchmark within Deus Ex: Mankind Divided averages 28 FPS at 1920x1200 resolution with Medium settings. I’m told it should be around 90 FPS.
nvidia-bug-report.log.gz (86.6 KB)

Can you try like this?

__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia __VK_LAYER_NV_optimus=NVIDIA_only

Both Steam game and Glxgears.

Btw, i don’t think a GT 1030 can pull 90 FPS on Deus Ex MD.

Just attempted the environment variables you suggested. The results are below.

$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia __VK_LAYER_NV_optimus=NVIDIA_only glxgears
Running synchronized to the vertical refresh. The framerate should be
approximately the same as the monitor refresh rate.
7411 frames in 5.0 seconds = 1482.076 FPS
5240 frames in 5.0 seconds = 1047.898 FPS
5448 frames in 5.0 seconds = 1088.962 FPS

I also attempted with VBLANK disabled.

__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia __VK_LAYER_NV_optimus=NVIDIA_only __GL_SYNC_TO_VBLANK=0 glxgears
6713 frames in 5.0 seconds = 1342.317 FPS
5970 frames in 5.0 seconds = 1193.964 FPS
6467 frames in 5.0 seconds = 1292.364 FPS

As you can see, no improvement.
And in regards to the Deus Ex FPS, I was running it on the GTX 1660Ti. So noticably higher FPS would be reasonable.

First of all, you can’t use glxgears on a prime setup to show anything. The framerate will always be very low due to the frame-copy overhead on very high fps. The results are completely useless.
On prime, always use some full-screen game or a unigine demo.
Second, I don’t know if nvidia2nvidia offloading does work at all or if some additional settings have to be made in order to specify the offload target.
As a first measure, please install nvidia-prime, then use it to check if glxgears/DeusEx is running on the 1660 at all.

Hmm, i should have read more carefully.

I really don’t know if there is a finer gpu selection method you can use on a dual Nvidia system.

One thing you can try is:

Forcing Deus EX MD to run with Proton and using DXVK device filter to actually offload it onto GTX 1660.

Sorry, I meant please install nvidia-smi, not nvidia-prime.

These commands were run in two terminal windows, simultaneously on the same system.

__NV_PRIME_RENDER_OFFLOAD=1 __GL_SYNC_TO_VBLANK=0 glxgears
7064 frames in 5.0 seconds = 1412.620 FPS
6333 frames in 5.0 seconds = 1266.173 FPS
6760 frames in 5.0 seconds = 1351.766 FPS
6150 frames in 5.0 seconds = 1229.806 FPS
5758 frames in 5.0 seconds = 1150.886 FPS
6953 frames in 5.0 seconds = 1390.509 FPS
7261 frames in 5.0 seconds = 1451.825 FPS

$ nvidia-smi
Mon Apr 12 08:20:01 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GT 1030 Off | 00000000:04:00.0 On | N/A |
| 35% 42C P0 N/A / 30W | 147MiB / 1991MiB | 19% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 166… Off | 00000000:0C:00.0 Off | N/A |
| 0% 37C P5 12W / 120W | 11MiB / 5944MiB | 26% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3484 G /usr/bin/X 104MiB |
| 1 N/A N/A 3484 G /usr/bin/X 6MiB |
| 1 N/A N/A 19333 G glxgears 2MiB |
±----------------------------------------------------------------------------+

I feel that this establishes that prime is technically working. However the performance continues to be poor.

And this was run while running the benchmark Deus EX MD.

nvidia-smi
Mon Apr 12 08:33:07 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56 Driver Version: 460.56 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GT 1030 Off | 00000000:04:00.0 On | N/A |
| 35% 39C P0 N/A / 30W | 141MiB / 1991MiB | 21% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 166… Off | 00000000:0C:00.0 Off | N/A |
| 48% 55C P0 68W / 120W | 1445MiB / 5944MiB | 47% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3484 G /usr/bin/X 98MiB |
| 1 N/A N/A 3484 G /usr/bin/X 97MiB |
| 1 N/A N/A 7795 G …e/Steam/ubuntu12_32/steam 33MiB |
| 1 N/A N/A 7802 G ./steamwebhelper 2MiB |
| 1 N/A N/A 18727 C+G …kind Divided/bin/DeusExMD 1305MiB |
±----------------------------------------------------------------------------+

Which DE are you using?
nvidia-smi confirms that deusex is running on the 1660 but gpu usage is quite low. Two things to try:

  1. disconnect the rotated monitor, iirc, this caused problems with prime at on point, idk if that has been fixed meanwhile.
  2. disable iommu, either in bios or by using the kernel parameter iommu=off

Edit: please also check the pci tx/rx throughput using nvidia-smi -q

My DE is LXDE. I like having something lightweight.

I ran the command nvidis-smi -q during the Deus Ex MD benchmark. Once before using the “iommu=off” kernel parameter and once after. No significant difference.

Below are the TX and RX results from before using the “iommu=off” parameter. (slightly edited for readability)

sleep 60 && nvidia-smi -q
==============NVSMI LOG==============
Timestamp : Mon Apr 12 10:27:58 2021
Driver Version : 460.56
CUDA Version : 11.2
Attached GPUs : 2
GPU 00000000:04:00.0
Product Name : GeForce GT 1030
Product Brand : GeForce
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Disabled

PCI
Bus : 0x04
Device : 0x00
Domain : 0x0000
Device Id : 0x1D0110DE
Bus Id : 00000000:04:00.0
Sub System Id : 0x8C981462
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 4x
Current : 4x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 1000 KB/s
Rx Throughput : 526000 KB/s

GPU 00000000:0C:00.0
Product Name : GeForce GTX 1660 Ti
Product Brand : GeForce RTX
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled

PCI
Bus : 0x0C
Device : 0x00
Domain : 0x0000
Device Id : 0x218210DE
Bus Id : 00000000:0C:00.0
Sub System Id : 0x3FBE1458
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 511000 KB/s
Rx Throughput : 404000 KB/s

LXDE is using openbox as WM, so that shouldn’t have any effect on performance.
Just to rule out this is deusex specific, please try with an Unigine demo or another demanding game.

Installed the Unigine Heaven demo using Phoronix Test Suite and ran it. I also confirmed it was running on the 1660.

Here are the results.

Unigine Heaven 4.0:
pts/unigine-heaven-1.6.5 [Resolution: 1920 x 1200 - Mode: Fullscreen - Renderer: OpenGL]
Test 1 of 1
Estimated Trial Run Count: 3
Estimated Time To Completion: 17 Minutes [13:00 EDT]
Started Run 1 @ 12:44:06
Started Run 2 @ 12:48:43
Started Run 3 @ 12:53:18

Resolution: 1920 x 1200 - Mode: Fullscreen - Renderer: OpenGL:
    36.2928
    36.4576
    36.4384

Average: 36.3963 Frames Per Second
Deviation: 0.25%

Comparison to 3,618 OpenBenchmarking.org samples since 15 June 2018; median result: 56.87. Box plot of samples:
[|-----------------##*###*####*##!###*##*###################------------------*----------------------------*---*----*-----------------------|*]
                     ^ This Result (31st Percentile): 36.3963
         Sapphire AMD Radeon RX 470: 68 ^    MSI NVIDIA GeForce GTX 1080: 133 ^                                 NVIDIA GeForce RTX 3080: 242 ^
           MSI AMD Radeon RX 470: 64 ^                                              NVIDIA GeForce GTX 1080 Ti: 198 ^
                              ^ NVIDIA GeForce GTX 1650: 51.18        Gigabyte NVIDIA GeForce GTX 1080 Ti: 190 ^
                         ^ MSI NVIDIA GeForce GTX 960: 43.47   Gigabyte NVIDIA GeForce RTX 2070 SUPER: 184 ^

============================================

EDIT (4/12/2021 13:35 EDT)
What. The. Frick.
Out of curiosity I started investigating the PCIe lanes of my motherboard. And look what I found.

dmesg | grep PCIe
[ 0.777846] acpi PNP0A08:00: _OSC: platform does not support [PCIeHotplug SHPCHotplug PME LTR]
[ 0.778003] acpi PNP0A08:00: _OSC: OS now controls [AER PCIeCapability]
[ 0.781639] pci 0000:02:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16 GT/s x4 link at 0000:00:01.2 (capable of 126.024 Gb/s with 16 GT/s x8 link)
[ 0.786586] pci 0000:04:00.0: 8.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x4 link at 0000:03:00.0 (capable of 31.504 Gb/s with 8 GT/s x4 link)
[ 0.788221] pci 0000:06:00.0: 31.506 Gb/s available PCIe bandwidth, limited by 16 GT/s x2 link at 0000:03:02.0 (capable of 63.012 Gb/s with 16 GT/s x4 link)
[ 0.790776] pci 0000:09:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16 GT/s x4 link at 0000:00:01.2 (capable of 252.048 Gb/s with 16 GT/s x16 link)
[ 0.793545] pci 0000:0a:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16 GT/s x4 link at 0000:00:01.2 (capable of 252.048 Gb/s with 16 GT/s x16 link)
[ 0.793983] pci 0000:0b:00.0: 63.012 Gb/s available PCIe bandwidth, limited by 16 GT/s x4 link at 0000:00:01.2 (capable of 252.048 Gb/s with 16 GT/s x16 link)
[ 0.794367] pci 0000:0c:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at 0000:00:03.1 (capable of 126.016 Gb/s with 8 GT/s x16 link)
[ 1.322708] igb 0000:07:00.0: eth0: (PCIe:2.5Gb/s:Width x1) 9c:5c:8e:bc:22:ba
[ 3.841312] nvidia: unknown parameter ‘Vreg_EnablePCIeGen3’ ignored

In case you don’t see the issue, both of the slots hosting the GPU’s are operating at PCIe 1.0 speeds. This MB is capable of 4.0 speeds. GGAAAHHH!!!

And for some salt in the wound, the nvidia driver did not recognize the PCIe 3.0 parameter.

This appears to be outside the context of NVIDIA and its products. So if I disappear for awhile, it is because I am wrestling with either the MB vendor or figuring out more dark secrets to my OS.

You’re misinterpreting the pci speed display. When the nvidia gpu throttles down clocks, it also throttles down pci speed. That’s why nvidia-smi also outputs it, on gpu usage:
PCIe Generation
Max : 3
Current : 3
on idle:
Current : 1
So there’s nothing wrong with your board.

I didn’t recognize this parameter, also don’t have it on my system. google finds !zero! entries about it. This module parameter does not seem to exist.

It is apparently supposed to be NVreg_EnablePCIeGen3 and included in the /etc/modprobe.d/nvidia.conf file.
It is listed on the Gentoo Wiki page.
I am aware the page is likely outdated.

Ah yes, a typo.
And now I found it, sorry just woke up :)

The description on the wiki page for that parameter is wrong.
Some early pcie gen3 chipsets were not working properly so the nvidia driver contains a blacklist for those. The parameter is to override that. So if you don’t run this on a 10 years old mainboard with a broken chipset, that parameter does nothing.

I have a new lead.
After fiddling with Bios settings, kernel compiles and running ‘sleep 60 && nvidia-smi -q’ during benchmarks. I spotted this for both the 1030 and 1660.

Performance State                     : P0
Clocks Throttle Reasons
    Idle                              : Active
    Applications Clocks Setting       : Not Active
    SW Power Cap                      : Not Active
    HW Slowdown                       : Not Active
        HW Thermal Slowdown           : Not Active
        HW Power Brake Slowdown       : Not Active
    Sync Boost                        : Not Active
    SW Thermal Slowdown               : Not Active
    Display Clock Setting             : Not Active

I checked prior tests and it was the same before.

Now if I understand performance states correctly, P0 means the card is at full power. Which makes sense because this is in the middle of a benchmark. And the same printout shows the 1660 clocks at, or near, their Max Hz. But somehow the clock is still “throttled” because it is “Idle”? What determines if an idle clock throttle is active?

(BTW somehow all that prior fiddling got me from 28fps to 32fps in the Deus Ex MD benchmark. So some good came from it.)

“Idle” throttling is always active, meaning powermizer is enabled vs. fixed clocking.
How many fps does glxgears report if you run it normally, i.e. locked to vsync?

__NV_PRIME_RENDER_OFFLOAD=1 glxgears
Running synchronized to the vertical refresh. The framerate should be
approximately the same as the monitor refresh rate.
12856 frames in 5.0 seconds = 2571.128 FPS
14342 frames in 5.0 seconds = 2868.393 FPS
32422 frames in 5.0 seconds = 6484.275 FPS

The final measurement was after the terminal window covered the glxgears window.

Edit (4/20/21)
Poking around I noticed something that may be wrong.

xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x1b8 cap: 0x1, Source Output crtcs: 2 outputs: 3 associated providers: 0 name:NVIDIA-0
Provider 1: id: 0x35f cap: 0x2, Sink Output crtcs: 4 outputs: 7 associated providers: 0 name:NVIDIA-G0

I have tried to change this but…

xrandr --setprovideroffloadsink 1 0
X Error of failed request: BadValue (integer parameter out of range for operation)
Major opcode of failed request: 139 (RANDR)
Minor opcode of failed request: 34 (RRSetProviderOffloadSink)
Value in failed request: 0x35f
Serial number of failed request: 16
Current serial number in output stream: 17

Ok.
I had hoped my next post would be to report success. I have spent a lot of time working on this on my own. There have been many dead ends.

I need a question answered.
According to ‘nvidia-smi -q’, on the 1660 TI, the maximum clock speed of the memory is 6001Mhz. But according to the advertised specs the speed should be 12000Mhz on the GDDR6 memory.
This would fit with the GPU usage hovering in the 50% range during benchmarks.
Now I understand that the advertised specs are probably the “effective speed” because of GDDR quirks and marketing.
My question is, does ‘nvidia-smi -q’ show the effective memory clock speed or the actual memory clock speed?