Bad performance with at high reported wattage usage on RTX A2000 in laptop with Ubuntu 22.04

The performance of the RTX A2000 in my notebook is sporadically quite bad (roughly >2x slower on machine learning applications using pytorch). This coincides with a high reported wattage usage (e.g. 61W / 35W) while the performance is as expected if the power consumption is only about 35W / 35W.

Once the reported ~60W appear, they stay at that level, even if at idle. The fans also are not spinning accordingly to how hot the GPU should get at 60W consumption.

The problem seems to only occur if an external display via USB-C is connected, though I am not confident that it doesn’t happen when connected via other means or without external display.

The external display also serves as the power connection via the USB-C connection.

I attached two nvidia-bug-report.sh files, the old one is not recorded with startx -- -logverbose 6 but more can’t hurt I guess.

nvidia-bug-report.log.gz (430.1 KB)
nvidia-bug-report.log.old.gz (454.8 KB)

inxi -F:

System:
  Host: d2hxtt3 Kernel: 6.5.0-17-generic x86_64 bits: 64 Desktop: GNOME 42.9
    Distro: Ubuntu 22.04.3 LTS (Jammy Jellyfish)
Machine:
  Type: Laptop System: Dell product: Precision 5570 v: N/A
    serial: <superuser required>
  Mobo: Dell model: 03M8N5 v: A00 serial: <superuser required> UEFI: Dell
    v: 1.20.0 date: 12/19/2023
Battery:
  ID-1: BAT0 charge: 84.3 Wh (100.0%) condition: 84.3/84.3 Wh (100.0%)
CPU:
  Info: 14-core (6-mt/8-st) model: 12th Gen Intel Core i9-12900H bits: 64
    type: MST AMCP cache: L2: 11.5 MiB
  Speed (MHz): avg: 621 min/max: 400/4900:5000:3800 cores: 1: 818 2: 400
    3: 680 4: 400 5: 973 6: 400 7: 867 8: 766 9: 913 10: 400 11: 866 12: 924
    13: 766 14: 858 15: 400 16: 400 17: 400 18: 400 19: 400 20: 400
Graphics:
  Device-1: Intel Alder Lake-P Integrated Graphics driver: i915 v: kernel
  Device-2: NVIDIA driver: nvidia v: 535.154.05
  Device-3: Microdia Integrated_Webcam_HD type: USB driver: uvcvideo
  Display: x11 server: X.Org v: 1.21.1.4 driver: X:
    loaded: modesetting,nvidia unloaded: fbdev,nouveau,vesa gpu: i915
    resolution: 1: 3840x2160~60Hz 2: 3840x2400~60Hz
  OpenGL: renderer: Mesa Intel Graphics (ADL GT2)
    v: 4.6 Mesa 23.0.4-0ubuntu1~22.04.1
Audio:
  Device-1: Intel Alder Lake PCH-P High Definition Audio
    driver: snd_hda_intel
  Sound Server-1: ALSA v: k6.5.0-17-generic running: yes
  Sound Server-2: PulseAudio v: 15.99.1 running: yes
  Sound Server-3: PipeWire v: 0.3.48 running: yes
Network:
  Device-1: Intel Alder Lake-P PCH CNVi WiFi driver: iwlwifi
  IF: wlp0s20f3 state: up mac: 14:75:5b:60:85:b5
  Device-2: Realtek RTL8153 Gigabit Ethernet Adapter type: USB
    driver: r8152
  IF: enxcc96e5d7a0d3 state: up speed: 1000 Mbps duplex: full
    mac: cc:96:e5:d7:a0:d3
  IF-ID-1: docker0 state: down mac: 02:42:7f:5e:d1:d4
Bluetooth:
  Device-1: Intel type: USB driver: btusb
  Report: hciconfig ID: hci0 state: up address: 14:75:5B:60:85:B9
Drives:
  Local Storage: total: 1.86 TiB used: 623.34 GiB (32.7%)
  ID-1: /dev/nvme0n1 vendor: Western Digital
    model: PC SN810 NVMe WDC 1024GB size: 953.87 GiB
  ID-2: /dev/nvme1n1 vendor: Western Digital
    model: PC SN810 NVMe WDC 1024GB size: 953.87 GiB
Partition:
  ID-1: / size: 937.33 GiB used: 623.27 GiB (66.5%) fs: ext4
    dev: /dev/nvme0n1p2
  ID-2: /boot/efi size: 511 MiB used: 66.2 MiB (13.0%) fs: vfat
    dev: /dev/nvme0n1p1
Swap:
  ID-1: swap-1 type: file size: 16 GiB used: 0 KiB (0.0%) file: /swapfile
Sensors:
  System Temperatures: cpu: 43.0 C mobo: N/A
  Fan Speeds (RPM): N/A
Info:
  Processes: 479 Uptime: 23m Memory: 31.01 GiB used: 6.62 GiB (21.4%)
  Shell: fish inxi: 3.3.13

Welcome @jost.triller to the NVIDIA developer forums!

Looking through the logs there is nothing obvious to see.

I would recommend you use nvidia-smi dmon to monitor the GPU behavior while running your ML workloads.

And then look out for irregularities or faults.

Also make sure that pytorch is actually compiled to use the GPU and does use it as well.
The fans do not necessarily start at 60W, `nvidid-smi" will also show temperatures. At 60W it might still very well inside the fanless limit.

On the other hand if it is not, then too high temps could cause the GPU to throttle, explaining the lower performance.

The external display might be a red herring, but do you also see these spikes when connecting only an external power supply? Right now you describe it as if the Laptop runs on battery when observing the low power states, which would be completely normal. As soon as power is connected the power scheme will change and performance will be adjusted.

Another thing to try would be to use prime-select nvidia to force NVIDIA GPU usage or set it through the nvidia settings similar to this:
image

1 Like

Pytorch is definitely using CUDA.

Throttling thermal can’t explain the performance issues, as it sometimes runs significantly faster for longer times, when all other things are equal, except the apparent power draw is only 35 W if it’s faster and ~60 W if it is slower.

It even runs faster just on battery than if the weird 60 W state appears. So low power alone is probably not the issue.

What seemed to have helped is to select the performance mode in the nvidia settings. I didn’t run into this issues since I enabled it.

So while the underlying issue is probably not solved, its good enough for me, thanks.

Unfortunately I have to revoke the solution, as it looks like, the performance mode means that the GPU has not enough free memory to be used for a task I have to do.