Bad performance and clocks issue on 384.90, Quadro M620 Mobile / Dell 3520 with Lightworks

gr00by · October 2, 2017, 7:47am

Hello.

I’m a Lightworks (NLE) user. I’m using a Linux Manjaro and 384.90 drivers. I’ve bought a mobile workstation to better organise my work and I’m dissappointed with general Nvidia GPU performance.

There is a problem with sleep&wakeup. After sleep the GPU stucks at low clock rate which affects overall performance - Powermizer switches to performance level 2 but clock is stuck at ~135-235MHz. After reboot - things getting back to a normal state.

The second thing is that there is a performance issue. We have a test project, a kind of benchmark project, which shows us how Lightworks works on different configurations and OS-es. For Linux, NVidia GPUs have always worse performance than ATI/AMD. Only Windows+Nvidia is really good. If you’re curious, here are two newest forum threads:

Export time benchmarks: https://www.lwks.com/index.php?option=com_kunena&func=view&catid=217&id=156673&Itemid=81#157781
General GPU performance tests: https://www.lwks.com/index.php?option=com_kunena&func=view&catid=217&id=150643&Itemid=81 (long story)

Please note that these benchmarks aren’t scientific, but shows how things work in a real usage.

Quick performance comparison

These results are from measuring export time of Lightworks test project, run on Manjaro Linux:

PC i7-2600K/HD6850: 45sec
Dell Inspiron i5-3337U/HD8730M: 1m10sec
Dell Vostro i7-7500U/GF940MX: 1m51sec
Dell Vostro i7-7500U/GF940MX: 1m19sec (WINDOWS)
Dell Precision i7-7700HQ/QuadroM620: 2m:58sec (clock stuck issue)
Dell Precision i7-7700HQ/QuadroM620: 1m:28sec (after reboot)
Dell Precision i7-7700HQ/QuadroM620: 32sec (WINDOWS)

As you can see Nvidia GPU performance in such use case is very poor. We cannot determine why it is so slow. We only know that AMD/Radeon and Windows+Nvidia works a way better than Linux+Nvidia.

Is there anything we can do to improve the performance on Linux? Why it is so slow? Why ATI/AMD (opensource drivers) and older cards are a way better (for our case) than powerful Nvidia chips?

Kind Regards,
Marcin

nvidia-bug-report-clock-stuck-plugged-after-wakeup.log.gz (147 KB)
nvidia-bug-report-reboot-no-ac.gz (135 KB)

generix · October 2, 2017, 10:34am

There are currently driver bugs affecting mobile Quadros, see:
[url]https://devtalk.nvidia.com/default/topic/1010612/linux/sluggish-performance-no-reclocking-ubuntu-17-04-kernel-4-12rc2-nvidia-quadro-m2200-driver-381-22-/[/url]
A workaround is to set nvidia-drm.modeset=1 as kernel parameter.

gr00by · October 2, 2017, 10:58am

Hi.

Thank you for the answer.
Kernel parameter did not help, but I’ll try play with this a bit later.

0     -    60     0     0     -     -  2505  1019
    0     -    60     0     0     -     -  2505  1019
    0     -    60     0     0     -     -  2505  1019
    0     -    60     0     0     -     -  2505  1019
    0     -    60     0     0     -     -  2505  1019
    0     -    58     0     0     -     -  2505   254 <-- sleep/wakeup
    0     -    58     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254
# gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     %     %     %     %   MHz   MHz
    0     -    59     0     0     -     -  2505   254
    0     -    59     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254
    0     -    59     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254
    0     -    59     0     0     -     -  2505   254
    0     -    58     0     0     -     -  2505   254

gr00by · October 2, 2017, 11:22am

I see that nvidia-drm is not loaded in /proc/modules. This may be due to bumblebee setup (I’m using intel gpu for common tasks and run Lightworks by optirun/primusrun). Maybe that’s why your hint does not work for me.

generix · October 2, 2017, 11:34am

Your logs look like you’re using PRIME, set up by Manjaro. This shouldn’t work with bumblebee.

gr00by · October 2, 2017, 11:46am

First I was used bubmlebee and the issue was exists. I didn’t know about bug you’ve mentioned.
Then I reconfigured setup manually to try PRIME. After that I got back to bumblebee.

I’ll remove everything, reinstall and try again with your hint.

generix · October 2, 2017, 11:53am

Ok, that makes sense.
Just as a note, the Quadro bug only affects performance in normal operation, not connected to the suspend/resume issue you’re facing. One user with similar problem but on gtx850M, no solution:
[url]https://devtalk.nvidia.com/default/topic/1023161/linux/33mhz-stuck-problem-after-waking-up-from-suspend/[/url]

gr00by · October 2, 2017, 12:40pm

Ah, ok. Thank you for explaination.

In case of slow performance with Lightworks, it is related to rendering (export). And this is probably something more exotic. Until sleep/wakeup issue occurs, the GPU works pretty good (nvidia-smi is reporting max clock values) - the playback with complex GPU effects is far better than using Intel GPU. However rendering (encoding as a final mp4 file) is slower than other systems (1m:28sec; Windows: 32sec, radeon on an old&cheap laptop: 1m10sec).

So summing up - Export (render) time on Linux is 3xslower than for Windows on the same machine. We know from Lightworks devs that there are differences between Linux (OpenGL) and Windows (Direct3d) ports, and Linux will be always slower due to implementation. And our test results confirms that. But 3x is too big, because it looks that AMD drivers works better in this particular case and the difference between windows and linux is smaller for ATI/AMD. I’m wondering why…

I will try to apply your hint, of course, and check again. Thank you for your support.

Mohandevir · October 2, 2017, 1:18pm

Sorry to interrupt, but I run an i7-3770 with a GTX960 and I have really bad stuttering issues with 384.90. The previous version of 384 was doing a great job.

Nothing shows on the fps counter of in games benchmarks (Dirt Showdown, Dirt Rally) but it feels like frames ordering is sometimes messed up.

Tried Composition Pipeline on/off. Tried different UI and flavors (KDE, Unity, Gnome, Ubuntu, Kubuntu, KDE-Neon), but nothing fixes it. With 381.22 everything is as expected.

Thanks.

Edit: Maybe I shouldn’t have posted in this thread, but was looking for something related to 384.90. Sorry I realize I should have created a new thread.

generix · October 2, 2017, 4:15pm

@Mohandevir: Please run nvidia-bug-report.sh, open a new thread and attach the created tar.gz file to it.

gr00by · October 5, 2017, 2:42pm

Lightworks uses OpenCL for effects processing, so my problem may be related to OpenCL? Is there somebody who can help me find the cause? I’m not only one - there are plenty Linux+Nvidia+Lightworks users, who are complaining about the performance. Or maybe is it related to missing PRIME offloading?

generix · October 5, 2017, 5:53pm

As I understand it, Lightworks uses pixel shaders i.e. DirectX on Windows and OpenGL on Linux/macOS for effects. De-/encoding is done on CPU. So it’s rather a question of bad OpenGL optimization/use on the Lightworks side.
e.g. the host-to-gpu test is basically how many 1080p frames can be copied to gpu memory per second. Not that many looking at the numbers.
PS: PRIME offload aka render offloading has nothing to do with performance but rather power saving/convenience.
GPU-Test: [url]Lightworks - Easy to Use Pro Video Editing Software

gr00by · October 5, 2017, 8:50pm

Thanks for clarification about PRIME.

We know from Lightworks devs that Linux port has some tricky parts due to multithreading issues, and they confirmed worse performance (comparing to Windows). But please look at my laptop’s results:

Dell Precision M3520 / i7 HQ / Quadro M620: 1m28s (384.90)
Dell Precision M3520 / i7 HQ / Intel GPU: 1m39s
Dell Vostro 5568 / i7U / GF940MX: 1m47s (375.82)
Dell Vostro 5568 / i7U / Intel GPU: 1m36s
Dell Inspiron / i5U / Radeon: 1m10s
Dell Inspiron / i5U / Intel GPU: 1m57s

Precision is most powerful, Vostro is a mid, Inspiron is most weak,old and cheap. As you can see, the most weak is the best in this test. It has Radeon HD 8730M installed, a GPU about 2-3 times slower than Quadro.

I can’t understand such differences. I’m just expecting a better result for Precision@Linux, at about 50 seconds.

Windows times (for reference):

Precision: 0m32sec
Vostro: 1m19sec
Inspiron: not tested

generix · October 5, 2017, 9:13pm

I agree that this is hard to understand but it is also hard to tell the reason.
When exporting a project, it boils down to
decode (cpu)->copy(GL?)->generate/apply effects(GL)->copy(GL?)->encode(cpu)
Question is, where is the missing time spent? Maybe LW is using GL commands that are better optimized in Mesa, maybe the de-/encoding is gpu assisted on Mesa. Only LW devs can tell that.
What driver was used with the AMD? fglrx, Mesa?

generix · October 5, 2017, 10:00pm

Some random thoughts on the numbers:
940MX vs. HD 8730M makes sense, the latter has twice the memory transfer rate.
The Dell Precision numbers are completely off, even for the Intel compared to the Vostro. I don’t know what codecs you use for export, if to some raw format this might be limited by disk transfer, if encoded it should be faster due to more cores unless the codec only uses one or two threads.
In the general Intel case, it’s a question if real copies occur or if the buffers are simply mmapped since it’s all system memory. Would save a lot of time.

gr00by · October 6, 2017, 7:45am

Thank you, generix.

All configs have SSD disks installed. Only Vostro has some slowdowns after few seconds while benchmarking. And I was curious and tried Inspiron on original HDD 5400 - there were no significant difference in results. All machines have at least 8GB RAM.

Export codecs were always same - MP4 720p (Lightworks “YouTube 720p” preset).

About “copies vs mmaped buffers” - I have no idea. I’ll ask devs. Thank you much for a hint.

When exporting a project, it boils down to
decode (cpu)->copy(GL?)->generate/apply effects(GL)->copy(GL?)->encode(cpu)

AFAIK it is something like that, in multiple threads.

Question is, where is the missing time spent? Maybe LW is using GL commands that are better optimized in > Mesa, maybe the de-/encoding is gpu assisted on Mesa. Only LW devs can tell that.

I’m trying to find the place of possible bottleneck for weeks, and I’m still learning differences betwen drivers. As you said - we can’t find the one function causing the slowdown, but we can try to reduce list of suspicious places. I’m writing here not for complete solution, but to catch information / confirmation that Nvidia chip nor driver can’t responsible for such bottlenecks / get some hints.

What driver was used with the AMD? fglrx, Mesa?

xf86-video-ati 7.9.0 (“radeon” module)
mesa 17.1.8
glu 9.0.0

Pastebin with xorg.log: http://pastebin.com/raw/dTghJWYe

BR,
Marcin

gr00by · October 6, 2017, 9:14am

A small update: I’ve repeated tests with kernel 4.12.14-1. Export test on Intel and Quadro M620: both 1m28s. The playback during editing is smoother with Quadro, though.

Topic		Replies	Views
Sluggish Performance/no Reclocking (Ubuntu 17.04, Kernel 4.12RC2, Nvidia Quadro M2200, Driver 381.22... Linux	49	9710	October 14, 2021
Very(!) slow ramp down from high to low clock speeds leading to a significantly increased power cons Linux	159	26068	February 6, 2024
Random low frame rate intervels no matter how much is running Linux	22	3752	October 27, 2024
Quadro T2000 throttles down to 300MHz and stays there Linux	44	6075	February 15, 2021
GPU Utilization Drops after Consecutive Executions CUDA Programming and Performance	28	5723	October 2, 2013
[Various/all Distros] Numerous Performance & Rendering Issues on 390.25 Linux	155	42301	October 6, 2018
Quadro P5200 Power / Performance problems in Manjaro \| Ubuntu on DELL Precision 7730 Linux	35	3941	November 27, 2019
33Mhz stuck problem after waking up from suspend Linux	28	4068	March 2, 2020
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2366	January 9, 2017
Laptop - GTX 1050 [driver 470.63.01] Manjaro linux-5.13.19-2 - X server black screen Linux driver	7	2318	November 14, 2021

Bad performance and clocks issue on 384.90, Quadro M620 Mobile / Dell 3520 with Lightworks

Related topics