Cuda performance - Parallel computing

user102399 · October 23, 2022, 7:09am

This post was opened on Nvidia Developer support and been asked to move it to Cuda. The incident no. is: Incident: 221020-000029
Response By Email (Vinoy) (10/20/2022 12:14 PM)

Hello, we are using laptop MSI GP66 Leopard with RTX3080 for development of 3D application for Tooth Scanner. We’re using CUDA to perform parallel computing on the GPU of the graphic card. We can measure the performance of our application during run. The average numbers that we got in the step of Cuda Correlation (the part of the application that use by the GPU) is ~40 msec.
I installed on this laptop Nvidia development tools -Nsight System, Nsight Compute and Nsight Graphic, after that the Cuda Correlation was down to 20 msec, which is a good thing for us.
Doing that on another Laptop with GTX 1070 didn’t change anything on the Cuda Correlations. Any idea what might be the reason for the reducing the time by half? and why it didn’t have such effect on the GTX1070?

In the RTX3080 Laptop we use MSI GP66 Leopard 11UH-032US-BB71180H16GXXDX10MA
Device ID: 10DE249C12FB1462, Part no. 47350010 driver version: 522.30

In the GTX1070 Laptop we use: GE63VR 7RF(Raider)-075US-BB7770H16G1T0DX10MH , the driver version for that is: 511.65

njuffa · October 23, 2022, 10:39am

“Objection your honor, calls for speculation.”

There is too little information provided in the question to do anything but speculate wildly. One thing you could try is to make sure that both system use exactly the same version of all NVIDIA software components, and that executables are built in exactly the same way on both systems, except for the GPU architecture specification during compilation (RTX 3080: sm_86, GTX 1070: sm_61). Ideally you would use the latest CUDA version (11.8) and the driver that comes with it on both systems.

Note that software may behave differently and have different performance characteristics depending on GPU architecture, so there are no guarantees that any performance-related observations would apply in the same way to two GPUs of different architecture.

user102399 · October 24, 2022, 11:28am

What more information you need? Those 2 laptops are being use for testing, not developing, they’re getting the same application, the RTX 3080 is windows 11 and the GTX1070 is windows 10, The 3080 has CUDA 11.6 & CUDA11.5 runtime and CUDA development 11.6 &11.5 installed , on 1070 don’t have any CUDA installed.
I understand the different behaviour based on the GPU, but my question is about the reason cause the runtime reduction by half, on the same laptop (after installing the CUDA suits), and this reason why the second laptop (1070) didn’t behave the same

njuffa · October 24, 2022, 11:48am

There are two systems that differ in terms of both hardware and software, and the behavior of an unknown application differs between the two, as determined by an undisclosed measurement methodology. That is (1) not a well-controlled experiment and (2) calls into question the validity of the reported observation (for me at least).

I was merely suggesting to make the software configuration of both systems as similar as possible, by standardizing on the latest CUDA version and associated driver, and paying close attention to the GPU target architectures specified in the build. Then revisit the alleged performance issue. Make sure to run an identical binary on both systems and use a fat binary with SASS code for both of the relevant GPU architectures.

The use of different GPU architectures could mean for example that different library routines are selected or that different compiler optimizations are applied, and the details thereof could further change with CUDA version. In other words, software is not necessarily orthogonal to hardware.

user102399 · October 25, 2022, 11:29am

Hi njuffa, thanks for your response. Maybe I didn’t explain myself. Those 2 laptops are for testing the SW, being developed in a third desktop with Nvidia GTX970, and compiled with VS2022 with CUDA 11.6 installed. Do you think that If I need to run this SW on RTX3080, and to get the highest performance, I need to compile it on RTX3080?

njuffa · October 25, 2022, 5:12pm

It doesn’t matter what GPU is in the system on which you build. My point is: Make sure you build an executable once, then distribute that executable to all the other machines.

In general, the performance of CUDA-accelerated applications will tend to increase as one upgrades the toolchain and libraries to newer version. However it is natural that any such improvements may not apply to all GPU architectures equally. Some improvements to machine-specific optimizations may not apply at all to older GPU architectures that lack particular features, for example when exploiting Tensor Cores.

The suggestion to switch to CUDA 11.8 is driven by this general trend that performance improves over time, and to ensure that any findings are relevant at this time. In the hypothetical case that you were to report a performance bug to NVIDIA, the first thing they would likely suggest is trying the latest shipping version, as there would be little point in spending time on investigating a shortcoming that may have been addressed already.

Robert_Crovella · October 25, 2022, 8:45pm

I’m not optimistic that anyone can describe the behavior of code you haven’t shown.

Some possibilities that I can think of that might be relevant to your observation are:

You installed a new CUDA version or GPU driver version when you installed the tools
When you installed the tools, you somehow switched from building a debug version of the code to a release version of the code.
When you installed the tools, you somehow switched from building a version of your code that required JIT compilation to run on the GPU in question, to one that didn’t.
Installation somehow changed the setting of windows hardware scheduling.

If none of those are applicable, I don’t have any further comments/ideas/suggestions. I also wouldn’t be able to respond any further to an incomplete test case.

user102399 · October 26, 2022, 7:35am

OK, thanks for your response, so you suggest installing CUAD 11.8 on the developer machine, as a first step for the investigation, on the test machines do I also need to install Cuda 11.8?

njuffa · October 26, 2022, 11:17am

You don’t need to install the CUDA toolkit on the test machines, but you should make sure all machines use the same driver package.

I’ll stop here.

Topic		Replies	Views
Strange performance regression with a single GPU context on a multi GPU host CUDA Programming and Performance	11	954	April 7, 2021
GTX295 Specefications & CUDA CUDA Programming and Performance	5	12288	October 7, 2010
CUDA very slow performance CUDA Programming and Performance	21	16749	March 6, 2020
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	251	July 7, 2024
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13457	July 9, 2008
GTX980ti faster than RTX 2080ti? CUDA Programming and Performance	12	524	August 19, 2020
Help choosing cuda adapter for research CUDA Programming and Performance	10	1276	October 25, 2016
I'm novice, please help -- pure performance CUDA Programming and Performance	17	60	October 30, 2024
Running multiple CUDA apps on same GPU card Serious performance drop CUDA Programming and Performance	1	1134	March 14, 2011
Cuda confusions a few clarifications on the programming methodology CUDA Programming and Performance	5	1459	October 1, 2011

Cuda performance - Parallel computing

Related topics