Cuda performance - Parallel computing

This post was opened on Nvidia Developer support and been asked to move it to Cuda. The incident no. is: Incident: 221020-000029
Response By Email (Vinoy) (10/20/2022 12:14 PM)

Hello, we are using laptop MSI GP66 Leopard with RTX3080 for development of 3D application for Tooth Scanner. We’re using CUDA to perform parallel computing on the GPU of the graphic card. We can measure the performance of our application during run. The average numbers that we got in the step of Cuda Correlation (the part of the application that use by the GPU) is ~40 msec.
I installed on this laptop Nvidia development tools -Nsight System, Nsight Compute and Nsight Graphic, after that the Cuda Correlation was down to 20 msec, which is a good thing for us.
Doing that on another Laptop with GTX 1070 didn’t change anything on the Cuda Correlations. Any idea what might be the reason for the reducing the time by half? and why it didn’t have such effect on the GTX1070?

In the RTX3080 Laptop we use MSI GP66 Leopard 11UH-032US-BB71180H16GXXDX10MA
Device ID: 10DE249C12FB1462, Part no. 47350010 driver version: 522.30

In the GTX1070 Laptop we use: GE63VR 7RF(Raider)-075US-BB7770H16G1T0DX10MH , the driver version for that is: 511.65

“Objection your honor, calls for speculation.”

There is too little information provided in the question to do anything but speculate wildly. One thing you could try is to make sure that both system use exactly the same version of all NVIDIA software components, and that executables are built in exactly the same way on both systems, except for the GPU architecture specification during compilation (RTX 3080: sm_86, GTX 1070: sm_61). Ideally you would use the latest CUDA version (11.8) and the driver that comes with it on both systems.

Note that software may behave differently and have different performance characteristics depending on GPU architecture, so there are no guarantees that any performance-related observations would apply in the same way to two GPUs of different architecture.

What more information you need? Those 2 laptops are being use for testing, not developing, they’re getting the same application, the RTX 3080 is windows 11 and the GTX1070 is windows 10, The 3080 has CUDA 11.6 & CUDA11.5 runtime and CUDA development 11.6 &11.5 installed , on 1070 don’t have any CUDA installed.
I understand the different behaviour based on the GPU, but my question is about the reason cause the runtime reduction by half, on the same laptop (after installing the CUDA suits), and this reason why the second laptop (1070) didn’t behave the same

There are two systems that differ in terms of both hardware and software, and the behavior of an unknown application differs between the two, as determined by an undisclosed measurement methodology. That is (1) not a well-controlled experiment and (2) calls into question the validity of the reported observation (for me at least).

I was merely suggesting to make the software configuration of both systems as similar as possible, by standardizing on the latest CUDA version and associated driver, and paying close attention to the GPU target architectures specified in the build. Then revisit the alleged performance issue. Make sure to run an identical binary on both systems and use a fat binary with SASS code for both of the relevant GPU architectures.

The use of different GPU architectures could mean for example that different library routines are selected or that different compiler optimizations are applied, and the details thereof could further change with CUDA version. In other words, software is not necessarily orthogonal to hardware.

Hi njuffa, thanks for your response. Maybe I didn’t explain myself. Those 2 laptops are for testing the SW, being developed in a third desktop with Nvidia GTX970, and compiled with VS2022 with CUDA 11.6 installed. Do you think that If I need to run this SW on RTX3080, and to get the highest performance, I need to compile it on RTX3080?

It doesn’t matter what GPU is in the system on which you build. My point is: Make sure you build an executable once, then distribute that executable to all the other machines.

In general, the performance of CUDA-accelerated applications will tend to increase as one upgrades the toolchain and libraries to newer version. However it is natural that any such improvements may not apply to all GPU architectures equally. Some improvements to machine-specific optimizations may not apply at all to older GPU architectures that lack particular features, for example when exploiting Tensor Cores.

The suggestion to switch to CUDA 11.8 is driven by this general trend that performance improves over time, and to ensure that any findings are relevant at this time. In the hypothetical case that you were to report a performance bug to NVIDIA, the first thing they would likely suggest is trying the latest shipping version, as there would be little point in spending time on investigating a shortcoming that may have been addressed already.

I’m not optimistic that anyone can describe the behavior of code you haven’t shown.

Some possibilities that I can think of that might be relevant to your observation are:

  • You installed a new CUDA version or GPU driver version when you installed the tools
  • When you installed the tools, you somehow switched from building a debug version of the code to a release version of the code.
  • When you installed the tools, you somehow switched from building a version of your code that required JIT compilation to run on the GPU in question, to one that didn’t.
  • Installation somehow changed the setting of windows hardware scheduling.

If none of those are applicable, I don’t have any further comments/ideas/suggestions. I also wouldn’t be able to respond any further to an incomplete test case.

OK, thanks for your response, so you suggest installing CUAD 11.8 on the developer machine, as a first step for the investigation, on the test machines do I also need to install Cuda 11.8?

You don’t need to install the CUDA toolkit on the test machines, but you should make sure all machines use the same driver package.

I’ll stop here.