Bad performance and clocks issue on 384.90, Quadro M620 Mobile / Dell 3520 with Lightworks

Hello.

I’m a Lightworks (NLE) user. I’m using a Linux Manjaro and 384.90 drivers. I’ve bought a mobile workstation to better organise my work and I’m dissappointed with general Nvidia GPU performance.

There is a problem with sleep&wakeup. After sleep the GPU stucks at low clock rate which affects overall performance - Powermizer switches to performance level 2 but clock is stuck at ~135-235MHz. After reboot - things getting back to a normal state.

The second thing is that there is a performance issue. We have a test project, a kind of benchmark project, which shows us how Lightworks works on different configurations and OS-es. For Linux, NVidia GPUs have always worse performance than ATI/AMD. Only Windows+Nvidia is really good. If you’re curious, here are two newest forum threads:

  • Export time benchmarks: https://www.lwks.com/index.php?option=com_kunena&func=view&catid=217&id=156673&Itemid=81#157781
  • General GPU performance tests: https://www.lwks.com/index.php?option=com_kunena&func=view&catid=217&id=150643&Itemid=81 (long story)

Please note that these benchmarks aren’t scientific, but shows how things work in a real usage.

Quick performance comparison

These results are from measuring export time of Lightworks test project, run on Manjaro Linux:

  • PC i7-2600K/HD6850: 45sec
  • Dell Inspiron i5-3337U/HD8730M: 1m10sec
  • Dell Vostro i7-7500U/GF940MX: 1m51sec
  • Dell Vostro i7-7500U/GF940MX: 1m19sec (WINDOWS)
  • Dell Precision i7-7700HQ/QuadroM620: 2m:58sec (clock stuck issue)
  • Dell Precision i7-7700HQ/QuadroM620: 1m:28sec (after reboot)
  • Dell Precision i7-7700HQ/QuadroM620: 32sec (WINDOWS)

As you can see Nvidia GPU performance in such use case is very poor. We cannot determine why it is so slow. We only know that AMD/Radeon and Windows+Nvidia works a way better than Linux+Nvidia.

Is there anything we can do to improve the performance on Linux? Why it is so slow? Why ATI/AMD (opensource drivers) and older cards are a way better (for our case) than powerful Nvidia chips?

Kind Regards,
Marcin

nvidia-bug-report-clock-stuck-plugged-after-wakeup.log.gz (147 KB)
nvidia-bug-report-reboot-no-ac.gz (135 KB)

There are currently driver bugs affecting mobile Quadros, see:
https://devtalk.nvidia.com/default/topic/1010612/linux/sluggish-performance-no-reclocking-ubuntu-17-04-kernel-4-12rc2-nvidia-quadro-m2200-driver-381-22-/
A workaround is to set nvidia-drm.modeset=1 as kernel parameter.

Hi.

Thank you for the answer.
Kernel parameter did not help, but I’ll try play with this a bit later.

0     -    60     0     0     -     -  2505  1019
    0     -    60     0     0     -     -  2505  1019
    0     -    60     0     0     -     -  2505  1019
    0     -    60     0     0     -     -  2505  1019
    0     -    60     0     0     -     -  2505  1019
    0     -    58     0     0     -     -  2505   254 <-- sleep/wakeup
    0     -    58     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254
# gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     %     %     %     %   MHz   MHz
    0     -    59     0     0     -     -  2505   254
    0     -    59     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254
    0     -    59     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254
    0     -    59     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254

I see that nvidia-drm is not loaded in /proc/modules. This may be due to bumblebee setup (I’m using intel gpu for common tasks and run Lightworks by optirun/primusrun). Maybe that’s why your hint does not work for me.

Your logs look like you’re using PRIME, set up by Manjaro. This shouldn’t work with bumblebee.

First I was used bubmlebee and the issue was exists. I didn’t know about bug you’ve mentioned.
Then I reconfigured setup manually to try PRIME. After that I got back to bumblebee.

I’ll remove everything, reinstall and try again with your hint.

Ok, that makes sense.
Just as a note, the Quadro bug only affects performance in normal operation, not connected to the suspend/resume issue you’re facing. One user with similar problem but on gtx850M, no solution:
https://devtalk.nvidia.com/default/topic/1023161/linux/33mhz-stuck-problem-after-waking-up-from-suspend/

Ah, ok. Thank you for explaination.

In case of slow performance with Lightworks, it is related to rendering (export). And this is probably something more exotic. Until sleep/wakeup issue occurs, the GPU works pretty good (nvidia-smi is reporting max clock values) - the playback with complex GPU effects is far better than using Intel GPU. However rendering (encoding as a final mp4 file) is slower than other systems (1m:28sec; Windows: 32sec, radeon on an old&cheap laptop: 1m10sec).

So summing up - Export (render) time on Linux is 3xslower than for Windows on the same machine. We know from Lightworks devs that there are differences between Linux (OpenGL) and Windows (Direct3d) ports, and Linux will be always slower due to implementation. And our test results confirms that. But 3x is too big, because it looks that AMD drivers works better in this particular case and the difference between windows and linux is smaller for ATI/AMD. I’m wondering why…

I will try to apply your hint, of course, and check again. Thank you for your support.

Sorry to interrupt, but I run an i7-3770 with a GTX960 and I have really bad stuttering issues with 384.90. The previous version of 384 was doing a great job.

Nothing shows on the fps counter of in games benchmarks (Dirt Showdown, Dirt Rally) but it feels like frames ordering is sometimes messed up.

Tried Composition Pipeline on/off. Tried different UI and flavors (KDE, Unity, Gnome, Ubuntu, Kubuntu, KDE-Neon), but nothing fixes it. With 381.22 everything is as expected.

Thanks.

Edit: Maybe I shouldn’t have posted in this thread, but was looking for something related to 384.90. Sorry I realize I should have created a new thread.

@Mohandevir: Please run nvidia-bug-report.sh, open a new thread and attach the created tar.gz file to it.

Lightworks uses OpenCL for effects processing, so my problem may be related to OpenCL? Is there somebody who can help me find the cause? I’m not only one - there are plenty Linux+Nvidia+Lightworks users, who are complaining about the performance. Or maybe is it related to missing PRIME offloading?

As I understand it, Lightworks uses pixel shaders i.e. DirectX on Windows and OpenGL on Linux/macOS for effects. De-/encoding is done on CPU. So it’s rather a question of bad OpenGL optimization/use on the Lightworks side.
e.g. the host-to-gpu test is basically how many 1080p frames can be copied to gpu memory per second. Not that many looking at the numbers.
PS: PRIME offload aka render offloading has nothing to do with performance but rather power saving/convenience.
GPU-Test: https://www.lwks.com/index.php?option=com_kunena&func=view&catid=21&id=15417&Itemid=81#23524

Thanks for clarification about PRIME.

We know from Lightworks devs that Linux port has some tricky parts due to multithreading issues, and they confirmed worse performance (comparing to Windows). But please look at my laptop’s results:

  • Dell Precision M3520 / i7 HQ / Quadro M620: 1m28s (384.90)
  • Dell Precision M3520 / i7 HQ / Intel GPU: 1m39s
  • Dell Vostro 5568 / i7U / GF940MX: 1m47s (375.82)
  • Dell Vostro 5568 / i7U / Intel GPU: 1m36s
  • Dell Inspiron / i5U / Radeon: 1m10s
  • Dell Inspiron / i5U / Intel GPU: 1m57s

Precision is most powerful, Vostro is a mid, Inspiron is most weak,old and cheap. As you can see, the most weak is the best in this test. It has Radeon HD 8730M installed, a GPU about 2-3 times slower than Quadro.

I can’t understand such differences. I’m just expecting a better result for Precision@Linux, at about 50 seconds.

Windows times (for reference):

  • Precision: 0m32sec
  • Vostro: 1m19sec
  • Inspiron: not tested

I agree that this is hard to understand but it is also hard to tell the reason.
When exporting a project, it boils down to
decode (cpu)->copy(GL?)->generate/apply effects(GL)->copy(GL?)->encode(cpu)
Question is, where is the missing time spent? Maybe LW is using GL commands that are better optimized in Mesa, maybe the de-/encoding is gpu assisted on Mesa. Only LW devs can tell that.
What driver was used with the AMD? fglrx, Mesa?

Some random thoughts on the numbers:
940MX vs. HD 8730M makes sense, the latter has twice the memory transfer rate.
The Dell Precision numbers are completely off, even for the Intel compared to the Vostro. I don’t know what codecs you use for export, if to some raw format this might be limited by disk transfer, if encoded it should be faster due to more cores unless the codec only uses one or two threads.
In the general Intel case, it’s a question if real copies occur or if the buffers are simply mmapped since it’s all system memory. Would save a lot of time.

Thank you, generix.

All configs have SSD disks installed. Only Vostro has some slowdowns after few seconds while benchmarking. And I was curious and tried Inspiron on original HDD 5400 - there were no significant difference in results. All machines have at least 8GB RAM.

Export codecs were always same - MP4 720p (Lightworks “YouTube 720p” preset).

About “copies vs mmaped buffers” - I have no idea. I’ll ask devs. Thank you much for a hint.

When exporting a project, it boils down to
decode (cpu)->copy(GL?)->generate/apply effects(GL)->copy(GL?)->encode(cpu)

AFAIK it is something like that, in multiple threads.

Question is, where is the missing time spent? Maybe LW is using GL commands that are better optimized in > Mesa, maybe the de-/encoding is gpu assisted on Mesa. Only LW devs can tell that.

I’m trying to find the place of possible bottleneck for weeks, and I’m still learning differences betwen drivers. As you said - we can’t find the one function causing the slowdown, but we can try to reduce list of suspicious places. I’m writing here not for complete solution, but to catch information / confirmation that Nvidia chip nor driver can’t responsible for such bottlenecks / get some hints.

What driver was used with the AMD? fglrx, Mesa?

xf86-video-ati 7.9.0 (“radeon” module)
mesa 17.1.8
glu 9.0.0

Pastebin with xorg.log: http://pastebin.com/raw/dTghJWYe

BR,
Marcin

A small update: I’ve repeated tests with kernel 4.12.14-1. Export test on Intel and Quadro M620: both 1m28s. The playback during editing is smoother with Quadro, though.