Hi, I have developed an application that performs some signal processing, implemented using CUDA kernels that I’ve written as well as CUDA API calls (e.g. fft).
when I run the application on GeForce RTX 2080ti machine everything works fast and fine, but when I’ve tried to run the same application on GeForce RTX 3090 (the hardware is the same, I replaced the GPU card) I experienced a performance problem - the application runs slower and it does not meet the real time requirements, and when profiled using Nsight systems I see odd behavior in the RTX 3090 case.
Attached are screenshots from both the 2080ti and 3090 - and as can be seen, work batch on the 2080ti takes ~5ms while the same work on the 3090 takes more than 27ms (with strange thread sleeps between the CUDA API calls and the executions).
I know that there is a relation between the CUDA toolkit version, the Nvidia driver version and the GPU model,
but I couldn’t find the combination that should work fine in the RTX 3090 case.
technical details:
RTX 2080ti
CUDA toolkit 10.1 (update 2)
Nvidia driver 456.71
works fine (satisfies real time requirements).
RTX 3090
I tested combinations of:
CUDA toolkit 10.1 / 11.2 / 11.3
and
Nvidia driver 456.71 / 466.47 / 462.59 (studio driver)
None of them worked (the application could not make CUDA API calls or/and it does not satisfy the real time requirements)
more details:
both systems have:
Windows 10 64bit
The GPU card is connected over PCIe gen 3, x16 and connected to two 8pins PSU cables.
CPU is AMD Ryzen Threadripper 2920x
mother board is x399 aourus pro-cf (Gigabyte)
I have tried DDU for uninstall previously installed driver(s) and it did not help.
Have someone encountered this weird situation? what can I do to solve that problem?
For RTX3090 I recommend CUDA 11.1 or newer,. That was the first CUDA version that officially supported cc8.6 GPUs.
Since you are on windows, I want to remind you not to evaluate performance if you are doing a debug build. Switch to a release build. I don’t know if that is applicable here but it is a common mistake.
As I’ve mentioned - I tried both 11.2 and 11.3 toolkit versions.
Regarding the the build type - I use CMAKE and build release version using -DCMAKE_BUILD_TYPE=“RelWithDebInfo” option, do you think I should use -DCMAKE_BUILD_TYPE=“Release” instead?
I tried the Release build, did not solve the problem, I’m still getting strange thread sleeps (now it’s called “Delayed execution” on the Nsight report, which does not happen in the 2080ti case).
I don’t have any further suggestions that are based strictly on the difference between 2080Ti and 3090. In fact if it were me, I would reverify the observation by shutting down the machine, pulling the 3090 out, popping the 2080Ti in, and without making any other changes of any kind, rerun the test for comparison.
Anything I say at this point you can easily throw the dart at which is “but why doesn’t it happen on the 2080T,?” So I can’t explain it without more debug work. For me, non interactive profiler screens are of only limited value. You might get more hints by asking questions about what you are seeing in the profiler on one of the profiler forums here.
If it were me, I would:
Reverify the observation as described above
Attempt to determine if the problem is purely a the kernel execution level, or purely at the activity timeline, or both.
If at the kernel level, I would seek to create a single-kernel launch reproducer, then study that between the two cases. A profiler would help to explain why the same kernel launch is taking 1ms on the 2080TI but 27ms (or whatever) on the 3090.
If at the activity level, I would be suspicious of the overall thread usage to launch work. See if running a single threaded work-launch test case behaves similarly or not.
Seek to create the simplest possible reproducer. This often provides clues.
Ask specific questions about the profiler on a profiler forum.
Also, others may chime in with better suggestions. This is a community.