Strange GPU performance

golyana.max · June 18, 2020, 9:18am

Hey there,

Currently i am developing PoC for analyze a few video streams with one GPU.
Software part: CUDA 10 and compatible cuDNN, EmguCV 4.0.1, YoloV3-Tiny(Darknet) and Windows 10 Enterprise (version 1903).
Hardware of first PC: i7 9700, Quadro P2000
Hardware of second PC: i7 9700, GTX 1660Ti

The problem is that on Quadro P2000 i can run easily 7-8 realtime streams with ~12 fps and networks size 416*416 and each stream loads GPU only 4%.
When i am trying to run the same streams on GTX1660Ti it can process only 2 streams in real time which loads GPU for 12-15% each. Its very sad result, because i bought this video card as a cheaper and more powerful(if compare specs of each GPU) alternative of Quadro P2000.
Note: Streams running as different instances. Not in one process.

Maybe someone can suggest what could be the problem?
Is GTX series dont support parallel computing or its just driver problem or something else?
Maybe i should change a OS or CUDA?

Workload for Quadro P2000 and GTX 1660Ti:

Thank you for any help,

Maksym

njuffa · June 18, 2020, 8:15pm

Generally speaking, hardware vendors structure their product offerings such that increased capabilities are associated with a higher selling price, with your observation providing proof of that principle.

I am not aware of a workaround to this issue, but maybe some other forum participant is.

rs277 · June 18, 2020, 9:36pm

Without knowing exactly what the workload is here, it looks like you are possibly suffering from the GTX’s lack of a dual copy engine - something only available on Quadro ( Tesla’s?).

See this PDF: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/Dual_copy_engines.pdf

Regards,

Richard

rs277 · June 21, 2020, 1:07am

Some more thoughts - apologies if you’re already aware of the following.

There is another potentially significant factor here, in addition to Quadro vs Geforce and that is the architecture of the cards. The Quadro is Pascal or Compute Capability 6.1 and the GTX is Turing, 7.5 and there are significant differences between the two regarding instruction throughput in some areas - BFE and BFI is one extreme.

So, one relatively easy way to eliminate whether or not the Quadro has an inherent advantage over Geforce, would be to repeat your test using a Geforce Pascal card. The GTX1060 would be a good choice, as the the Quadro P2000 is the same GPU and so functionally almost identical. The 3GB version has 2 SMs disabled and so is the same as the P2000, the 6GB version has the full 10.

Regardless, it would ceratinly be worthwhile profiling the two cards, as this should help pinpoint where the bottleneck is on the 1660.

Regards,

Richard

Topic		Replies	Views
Quadro vs Geforce GTX CUDA Programming and Performance	20	25021	September 20, 2013
compare performance across different GPU cards and how to figure out the frequency the GPU clock? CUDA Programming and Performance	4	9939	June 14, 2010
Buying a CUDA card - questions CUDA Programming and Performance	2	6573	June 22, 2011
Strange performance regression with a single GPU context on a multi GPU host CUDA Programming and Performance	11	957	April 7, 2021
Adding CUDA cores from Multiple Different Quadro Cards CUDA Programming and Performance	5	7404	March 11, 2013
using one GPU for computations and display CUDA Programming and Performance	7	5243	May 10, 2008
Streams concurrency bad performance CUDA Programming and Performance	3	2029	June 13, 2012
Multiple GPUs and copying data. CUDA Programming and Performance	2	3906	September 23, 2010
multi-GPUs with streams. Seems only one device overlapping copies CUDA Programming and Performance	9	1635	October 30, 2015
One GPU of four running slowly? CUDA Programming and Performance	4	2125	March 26, 2009

Strange GPU performance

Related topics