RTX3090 runs slower than RTX2080ti

Hi, I have developed an application that performs some signal processing, implemented using CUDA kernels that I’ve written as well as CUDA API calls (e.g. fft).

when I run the application on GeForce RTX 2080ti machine everything works fast and fine, but when I’ve tried to run the same application on GeForce RTX 3090 (the hardware is the same, I replaced the GPU card) I experienced a performance problem - the application runs slower and it does not meet the real time requirements, and when profiled using Nsight systems I see odd behavior in the RTX 3090 case.

Attached are screenshots from both the 2080ti and 3090 - and as can be seen, work batch on the 2080ti takes ~5ms while the same work on the 3090 takes more than 27ms (with strange thread sleeps between the CUDA API calls and the executions).

I know that there is a relation between the CUDA toolkit version, the Nvidia driver version and the GPU model,
but I couldn’t find the combination that should work fine in the RTX 3090 case.

technical details:

  1. RTX 2080ti
    CUDA toolkit 10.1 (update 2)
    Nvidia driver 456.71

works fine (satisfies real time requirements).

  1. RTX 3090
    I tested combinations of:
    CUDA toolkit 10.1 / 11.2 / 11.3
    Nvidia driver 456.71 / 466.47 / 462.59 (studio driver)

None of them worked (the application could not make CUDA API calls or/and it does not satisfy the real time requirements)

more details:
both systems have:
Windows 10 64bit
The GPU card is connected over PCIe gen 3, x16 and connected to two 8pins PSU cables.
CPU is AMD Ryzen Threadripper 2920x
mother board is x399 aourus pro-cf (Gigabyte)

I have tried DDU for uninstall previously installed driver(s) and it did not help.

Have someone encountered this weird situation? what can I do to solve that problem?