CUDA Performance problems depending on a system

Hi, I have 2 machines that are exactly the same hardware-wise.
On them, I have 1 Debian, and 1 custom build with buildroot linux image.

The problem I’m experiencing is that:
For my application, I’m getting roughly 2x performance on the Debian machine in the relation to custom build image machine.
I’m trying to understand what could cause this difference.
The cuda version is 7.5
The driver versions are:
Custom: 367.27

Some information that may be relevant:
If I run deviceQuery (from cuda samples) on both machines, the results are almost but not exactly the same:
1st difference is in line:
This is for Debian:
Total amount of global memory: 3069 MBytes
And this is for Custom:
Total amount of global memory: 3008 MBytes
(Though I doubt that this difference can cause the mentioned difference in performance.)
The other difference is:
Run time limit on kernels: Yes
Run time limit on kernels: No

If I run the bandwidthTest from samples, the results are more/less the same for Host to Device Bandwidth, 1 Device(s) and Device to Host Bandwidth, 1 Device(s), but can differ significantly for
Device to Device Bandwidth, 1 Device(s)
So for the last entry values on Debian is in the area of 100k MB/s (though sometimes it drops down to around 63k)
For Custom it’s consistently on the level of 63k.

I would be glad if you could help me with advice on what should I investigate further and what could be the problem.
Thank you!

Based on this:

it seems like in one case the X-server is configured to use that GPU and in the other case it is not.

I don’t really know if that would explain anything, but it’s possible it could show a difference machine-to-machine in performance. However, I would expect the machine with X configured would be slower, not faster.

Yes, that would be my expectations also. I’ve tried configuring the Custom system to use nvidia, but encountered some problems: can’t load GLX module, maybe it’s related…

No, it seems that not loading GLX was xorg configuration problem.
Xorg won’t start anyway but, for some other reason now.

Interesting fact - if I start my application when X is in this half-started state - performance drops even more (more than 4x decrease from the Debian in total)

[I just posted an article on the Premiere Pro team blog based on the information and questions in this forum thread.] DataStage Training

‘Mercury Playback Engine’ is a name for a large number of performance improvements in Premiere Pro CS5. Those improvements include the following: R Programming Training

  • 64-bit application
  • multithreaded application
  • processing of some things using CUDA Android Training

Everyone who has Premiere Pro CS5 has the first two of these. Only the third one depends on having a specific graphics card. SharePoint Training

Confusingly—because of one of our own early videos that was just plain unclear—a lot of people think that ‘Mercury’ just refers to CUDA processing. This is wrong. To see that this was not the original intent, you need look no further than the project settings UI strings ‘Mercury Playback Engine GPU Acceleration’ and ‘Mercury Playback Engine Software Only’, which would make no sense if ‘Mercury’ meant “hardware” (i.e., CUDA). SQL Training

The official and up-to-date list of the cards that provide the CUDA processing features is here:

Some of the cards on that list are only enabled if you have the recent updates. SAS Training

On Mac OS, CUDA processing features of Premiere Pro CS5 require Mac OSX v10.6.3 or later.

CUDA is an Nvidia technology, so only Nvidia cards provide it.

If you don’t have one of these CUDA cards, you can still use Premiere Pro CS5; you just won’t get the advantages of processing with CUDA.

Here’s a list of things that Premiere Pro CS5 can process with CUDA:

  • some effects
  • scaling
  • deinterlacing
  • blending modes
  • color space conversions

It’s worth mentioning one set of things that Premiere Pro CS5 doesn’t process using CUDA: encoding and decoding.

Note that whether a frame can be processed by CUDA depends on the size of the frame and the amount of RAM on the graphics card (VRAM). This article gives details about that, toward the bottom.

Processing with CUDA doesn’t just mean that things are faster. In some cases, it can actually mean that results are better, as with scaling. See this article for details.

The term ‘Mercury Playback Engine’ refers to Premiere Pro. It has nothing to do with After Effects. After Effects CS5 is a 64-bit application, and it has been multithreaded for a long time, so those improvements are there. But After Effects doesn’t use CUDA (though a few third-party plug-ins do).

I’ve just tried updating to CUDA 8 on a custom machine - hasn’t changed the described situation.
Shows 2x or more perf reduction on a Custom system for every type of operation it can measure.
(And also almost 2x on device to device memory copy)

CUDA is really only tested to work correctly on the configurations that are listed in the install guide. It’s not expected to work correctly on any possible custom configuration.

Well, It’s not like there is something magical about those systems. If one can be configured to work correctly with cuda, so can the other. The question is what is the key difference here, that is causing the problems.

Ok, what finally helped was updating the way nvidia driver was installed (buildroot mk).
And running application after starting X server.
Though this remains the same:
Run time limit on kernels: No