RTX3090 runs slower than RTX2080ti

lior4 · June 3, 2021, 7:42pm

Hi, I have developed an application that performs some signal processing, implemented using CUDA kernels that I’ve written as well as CUDA API calls (e.g. fft).

when I run the application on GeForce RTX 2080ti machine everything works fast and fine, but when I’ve tried to run the same application on GeForce RTX 3090 (the hardware is the same, I replaced the GPU card) I experienced a performance problem - the application runs slower and it does not meet the real time requirements, and when profiled using Nsight systems I see odd behavior in the RTX 3090 case.

Attached are screenshots from both the 2080ti and 3090 - and as can be seen, work batch on the 2080ti takes ~5ms while the same work on the 3090 takes more than 27ms (with strange thread sleeps between the CUDA API calls and the executions).

I know that there is a relation between the CUDA toolkit version, the Nvidia driver version and the GPU model,
but I couldn’t find the combination that should work fine in the RTX 3090 case.

technical details:

RTX 2080ti
CUDA toolkit 10.1 (update 2)
Nvidia driver 456.71

works fine (satisfies real time requirements).

RTX 3090
I tested combinations of:
CUDA toolkit 10.1 / 11.2 / 11.3
and
Nvidia driver 456.71 / 466.47 / 462.59 (studio driver)

None of them worked (the application could not make CUDA API calls or/and it does not satisfy the real time requirements)

more details:
both systems have:
Windows 10 64bit
The GPU card is connected over PCIe gen 3, x16 and connected to two 8pins PSU cables.
CPU is AMD Ryzen Threadripper 2920x
mother board is x399 aourus pro-cf (Gigabyte)

I have tried DDU for uninstall previously installed driver(s) and it did not help.

Have someone encountered this weird situation? what can I do to solve that problem?

Thanks

Robert_Crovella · June 3, 2021, 7:52pm

For RTX3090 I recommend CUDA 11.1 or newer,. That was the first CUDA version that officially supported cc8.6 GPUs.

Since you are on windows, I want to remind you not to evaluate performance if you are doing a debug build. Switch to a release build. I don’t know if that is applicable here but it is a common mistake.

lior4 · June 3, 2021, 8:22pm

thanks for the quick reply.

As I’ve mentioned - I tried both 11.2 and 11.3 toolkit versions.

Regarding the the build type - I use CMAKE and build release version using -DCMAKE_BUILD_TYPE=“RelWithDebInfo” option, do you think I should use -DCMAKE_BUILD_TYPE=“Release” instead?

lior4 · June 3, 2021, 8:51pm

I tried the Release build, did not solve the problem, I’m still getting strange thread sleeps (now it’s called “Delayed execution” on the Nsight report, which does not happen in the 2080ti case).

Do you have any other suggestions?

thanks

Robert_Crovella · June 3, 2021, 9:09pm

I don’t have any further suggestions that are based strictly on the difference between 2080Ti and 3090. In fact if it were me, I would reverify the observation by shutting down the machine, pulling the 3090 out, popping the 2080Ti in, and without making any other changes of any kind, rerun the test for comparison.

Anything I say at this point you can easily throw the dart at which is “but why doesn’t it happen on the 2080T,?” So I can’t explain it without more debug work. For me, non interactive profiler screens are of only limited value. You might get more hints by asking questions about what you are seeing in the profiler on one of the profiler forums here.

If it were me, I would:

Reverify the observation as described above
Attempt to determine if the problem is purely a the kernel execution level, or purely at the activity timeline, or both.
If at the kernel level, I would seek to create a single-kernel launch reproducer, then study that between the two cases. A profiler would help to explain why the same kernel launch is taking 1ms on the 2080TI but 27ms (or whatever) on the 3090.
If at the activity level, I would be suspicious of the overall thread usage to launch work. See if running a single threaded work-launch test case behaves similarly or not.
Seek to create the simplest possible reproducer. This often provides clues.
Ask specific questions about the profiler on a profiler forum.

Also, others may chime in with better suggestions. This is a community.

lior4 · June 6, 2021, 8:25am

I’ll try these things, especially 4. Thanks for your attention.

Topic		Replies	Views
RTX3090 runs slower than RTX2080ti Profiling Linux Targets nsight	1	746	July 19, 2021
3090ti Problem Nvidia suport unable to answer. Can anyone answer this? Linux	6	865	June 18, 2022
Cuda performance - Parallel computing CUDA Programming and Performance	8	768	October 26, 2022
RTX 3070 with CUDA10.0 compatibility [UbuntuOS, any version] Linux	15	11505	February 25, 2021
GTX980ti faster than RTX 2080ti? CUDA Programming and Performance	12	524	August 19, 2020
I'm novice, please help -- pure performance CUDA Programming and Performance	17	60	October 30, 2024
Unexpected Performance Discrepancy Between RTX A6000 and RTX 3090 GPU - Hardware gpu	1	194	September 20, 2024
What could be possible reasons for affecting the kernel launch overhead for fast small kernels? CUDA Programming and Performance	5	29	October 22, 2024
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2762	November 17, 2021
Tensorflow1.14 is not working on RTX3090 inside the Docker container of Ubuntu18.04 and CUDA10.0 with Python2 CUDA Programming and Performance cuda , tensorflow , ubuntu , docker	11	5487	April 2, 2022

RTX3090 runs slower than RTX2080ti

Related topics