RTX3090 runs slower than RTX2080ti

Hi, I have developed an application that performs some signal processing, implemented using CUDA kernels that I’ve written as well as CUDA API calls (e.g. fft).

when I run the application on GeForce RTX 2080ti machine everything works fast and fine, but when I’ve tried to run the same application on GeForce RTX 3090 (the hardware is the same, I replaced the GPU card) I experienced a performance problem - the application runs slower and it does not meet the real time requirements, and when profiled using Nsight systems I see odd behavior in the RTX 3090 case.

Attached are screenshots from both the 2080ti and 3090 - and as can be seen, work batch on the 2080ti takes ~5ms while the same work on the 3090 takes more than 27ms (with strange thread sleeps between the CUDA API calls and the executions).

I know that there is a relation between the CUDA toolkit version, the Nvidia driver version and the GPU model,
but I couldn’t find the combination that should work fine in the RTX 3090 case.

technical details:

  1. RTX 2080ti
    CUDA toolkit 10.1 (update 2)
    Nvidia driver 456.71

works fine (satisfies real time requirements).

  1. RTX 3090
    I tested combinations of:
    CUDA toolkit 10.1 / 11.2 / 11.3
    and
    Nvidia driver 456.71 / 466.47 / 462.59 (studio driver)

None of them worked (the application could not make CUDA API calls or/and it does not satisfy the real time requirements)

more details:
both systems have:
Windows 10 64bit
The GPU card is connected over PCIe gen 3, x16 and connected to two 8pins PSU cables.
CPU is AMD Ryzen Threadripper 2920x
mother board is x399 aourus pro-cf (Gigabyte)

I have tried DDU for uninstall previously installed driver(s) and it did not help.

Have someone encountered this weird situation? what can I do to solve that problem?

Thanks


Hi lior4,
I don’t know what is the reason for the phenomena you’re observing, but here are a couple of things to try that might shed more light on it or provide clues:

  1. Enable callstack collection, capture a trace on RTX 3090 and on the timeline view, hover the mouse over the thread ranges where the thread is in a blocking wait state. The callstack of the wait call should appear. See which call is performing the wait. To add more details you can right-click the report and select “Resolve symbols…”, and the URL of a symbol server such as Microsoft’s Symbol information as well as a local cache directory.
  2. Update Nsight Systems to the recently released 2021.3.1, enable GPU Metrics collection and capture a trace with the RTX 3090 setup. See if any of the GPU metrics provides a clue. Here is a short video with this feature highlights

Good luck,

Doron