Strange GPU performance

Hey there,

Currently i am developing PoC for analyze a few video streams with one GPU.
Software part: CUDA 10 and compatible cuDNN, EmguCV 4.0.1, YoloV3-Tiny(Darknet) and Windows 10 Enterprise (version 1903).
Hardware of first PC: i7 9700, Quadro P2000
Hardware of second PC: i7 9700, GTX 1660Ti

The problem is that on Quadro P2000 i can run easily 7-8 realtime streams with ~12 fps and networks size 416*416 and each stream loads GPU only 4%.
When i am trying to run the same streams on GTX1660Ti it can process only 2 streams in real time which loads GPU for 12-15% each. Its very sad result, because i bought this video card as a cheaper and more powerful(if compare specs of each GPU) alternative of Quadro P2000.
Note: Streams running as different instances. Not in one process.

Maybe someone can suggest what could be the problem?
Is GTX series dont support parallel computing or its just driver problem or something else?
Maybe i should change a OS or CUDA?

Workload for Quadro P2000 and GTX 1660Ti:

Thank you for any help,

Maksym

Generally speaking, hardware vendors structure their product offerings such that increased capabilities are associated with a higher selling price, with your observation providing proof of that principle.

I am not aware of a workaround to this issue, but maybe some other forum participant is.

1 Like

Without knowing exactly what the workload is here, it looks like you are possibly suffering from the GTX’s lack of a dual copy engine - something only available on Quadro ( Tesla’s?).

See this PDF: https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/quadro-product-literature/Dual_copy_engines.pdf

Regards,

Richard

1 Like

Some more thoughts - apologies if you’re already aware of the following.

There is another potentially significant factor here, in addition to Quadro vs Geforce and that is the architecture of the cards. The Quadro is Pascal or Compute Capability 6.1 and the GTX is Turing, 7.5 and there are significant differences between the two regarding instruction throughput in some areas - BFE and BFI is one extreme.

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

So, one relatively easy way to eliminate whether or not the Quadro has an inherent advantage over Geforce, would be to repeat your test using a Geforce Pascal card. The GTX1060 would be a good choice, as the the Quadro P2000 is the same GPU and so functionally almost identical. The 3GB version has 2 SMs disabled and so is the same as the P2000, the 6GB version has the full 10.

Regardless, it would ceratinly be worthwhile profiling the two cards, as this should help pinpoint where the bottleneck is on the 1660.

Regards,

Richard

1 Like