I have an algorithm running on CUDA (no AI, no tensors involved)
Its performance is currently limited under Windows by the WDDM mode of my RTX 3090.
I know that the TCC mode would be available on Tesla cards, but what I need the most is Cuda cores and high memory bandwidth, not tensor cores.
I can’t understand the commercial offer of NVidia and can’t identifity by myself which product would fit the best. (considering that Quadro cards are orders of magnitude more expensive than GeForce)
I am sorry to post such a question on a technical forum, but who should I ask instead ?
Do you need integer or FP32 or FP64 arithmetic, do you need large L2 caches (you said memory bandwidth was important), do you need ECC memory or other reliability features (for 24h use), do you need a certain memory size, do you need a maximal energy consumption/heat generation, should it be handled by one GPU or possibly more than one (e.g. can you independently calculate different parts of your algorithm)? Do you need data transfer to other PCIe cards, e.g. over the network, data grabbers, FPGAs, SSD drives, …?
What aspect of your performance does WDDM limit?
BTW: Many algorithms can be rewritten to take use of tensor cores, even if they do not use it now.
I need FP32, not especially FP64
I don’t really use NPP (I have coded almost all the CUDA kernels myself)
I don’t need reliability features
I don’t need a huge memory size (a few GB is enough)
my algorithm already supports multiple GPUs (and is able to route and sync work between them)
my multi-gpu mode is efficient (perf close to xN for N GPUs)
My bandwidth requirements are :
-feeding the GPU with some data from memory
-transfer data between GPUs
-get resulting data from GPU to memory
(no need to transfer data elsewhere)
About the WDDM limitation :
after investigation, I have identified that I was limited by the latency to launch kernels. I already use streams, and CUDA graphs won’t help with my algorithm structure.
With a NVidia T1000 (that supports TCC), I have verified that switching from WDDM to TCC unleashes the performance. But the core CUDA performance of my T1000 is still far below of my 3090 RTX.
This CUDA-based engine is part of a solution deployed on Windows only
Is PCIe enough or do you need NVLink? What about L2 cache size (do you access data in device memory multiple times or just once)? Are two half-fast cards as good as one full speed card for your algorithm (as your performance increases linearly)? What is the limiting factor? Price?
Latency to launch kernels is more CPU driven than GPU driven.
The kernel execution speed of course is faster with more SMs.
How many graphics cards can your motherboard handle?
You can get 6 RTX 3060 or 6 RTX 4060 for the price of 1 RTX 3090.
Does the platform absolutely have to be Windows? Could you switch to Linux?
How important is the GPU bandwidth? There are not all that many GPUs in NVIDIA’s professional line that can meet or exceed the memory bandwidth of the RTX 3090 (936.2 GB/s). If this is a hard requirement, you would basically be looking at something like the RTX 6000 Ada (960.0 GB/s) and at that point you are really going to feel the financial hurt.
Have you considered acquiring a government surplus, 2nd-hand, or refurbished GPU?
Was the TCC vs WDDM comparison performed on the same host platform? I have a Windows machine where one GPU runs with TCC and the other with WDDM, and while the performance difference can be measured, it is no very pronounced: Average kernel launch overhead is similar, but with WDDM the overhead varies widely and can reach up to 10x of the average in the worst case. This is an older Skylake-W based host system.
PCIe is OK
Not an easy answer. I would have to test in order to observe how L2 size really impacts the overall speed, but see below : I don’t think it matters a lot.
The idea is that the more GPU the user can run, the better the speed on his host machine. I don’t target a particular speed : I just want to unleash the maximum speed for a given configuration.
I have a fully pipelined process (on a cudaStream). It runs on a live input data stream. In my reference test configuration, it runs at 950 executions/second with an RTX 3090 under Windows (WDDM mode). The limiting factor is how many times per second I can launch the process, which itself lasts less than 1 ms.
With a T1000, the same configuration runs at ~100 executions/second. But if I set the T1000 to TCC mode (with nvidia-smi), it goes really faster (can’t remember, but something like 600 executions/second if I remember correctly).
So I would like to get a GPU with at least the RTX 3090 core performance, but not limited by the kernel launch time (WDDM bottleneck). I had written a another post that nobody could answer.
Since NVidia is continuously rebranding , using names like Tesla/Quadro that clash with RTX terminology, and does not explicitely mention TCC/WDDM in the board specifications, it’s really hard to know which GPU I should focus on. Data-center GPUs are really expensive, and it would be a pity that no other GPU can be TCC compatible like the T1000 was.
Not to mention that TCC is deprecated in that doc, but this is the only reference of such a deprecation.
In the short term : Windows only.
See my other answer in the same thread.
How sure are you of this result? I cannot think of any scenario or plausible mechanism where WDDM vs TCC would cause a 6x performance difference. I would suggest re-doing this experiment before pursuing the acquisition of a TCC-enabled GPU with performance equivalent to the RTX 3090.
What is the host platform being used? That will have an influence on host-side overhead in the WDDM driver. On a reasonably current x86-64 platform, you might observe something like 2.5 microsecond kernel launch overhead with TCC, 3.5 microsecond average kernel launch overhead with WDDM (with occasional peaks of up to 20 microsecond kernel launch overhead). With a kernel execution time of, say, 0.5 milliseconds = 500 microseconds, I would not expect very significant performance difference at application level.
I am pretty confident that there was a huge gain, while the only difference was the driver mode.
But you’re right, I should give precise figures.
I’ll test again ASAP (but not before a few days/weeks : I don’t have access to the hardware for now)
That was Windows 11 up-to-date.
I was interested in the hardware configuration of the host system; I should have said so.
The overhead of the WDDM driver versus TCC is mostly layers and layers of software overhead (so the OS can be in complete control of the GPU). Code executing on the host. Slow host CPU → increased WDDM overhead. Not to the tune of 6x application-level performance difference, though.
Did you mean “the time to launch a kernel on the GPU” from within a running process? If so, that should be in the single-digit microseconds, meaning you can do that hundreds of thousands of times per second.
If you actually mean you are starting a new process for every single use of the GPU, the CUDA startup overhead could be the performance killer. If so, I couldn’t offhand tell you how that start-up overhead differs between TCC and WDDM drivers; I don’ recall ever measuring that.
In any event, here is a quick overview of potential candidates, with data taken from the TechPowerUp database (which I find to be quite reliable, but do double-check before purchasing a GPU):
RTX 3090 (Ampere, CC 8.6)
Bandwidth 936.2 GB/s
FP32 35.58 TFLOPS
------------------------
Ampere (CC 8.6):
RTX A5000
Bandwidth 768.0 GB/s
FP32 27.77 TFLOPS
RTX A5500
Bandwidth 768.0 GB/s
FP32 34.10 TFLOPS
RTX A6000
Bandwidth 768.0 GB/s
FP32 38.71 TFLOPS
-------------------------
Ada Lovelace (CC 8.9):
RTX 4500 Ada
Bandwidth 432.0 GB/s
FP32 39.63 TFLOPS
RTX 5000 Ada
Bandwidth 576.0 GB/s
FP32 65.28 TFLOPS
RTX 5880 Ada
Bandwidth 864.0 GB/s
FP32 69.27 TFLOPS
RTX 6000 Ada
Bandwidth 960.0 GB/s
FP32 91.06 TFLOPS
Then have a separate process calling Cuda in the background (the other programs communicate with it and request computations) continuously running and try to optimize the Windows system for starting processes and multitasking between processes faster.
The speed probably has less to do with the GPU.
I ran some quick experiments to see whether CUDA initialization is faster with the TCC driver compared to WDDM. The results are inconsistent and inconclusive. It seems that other factors dominate. Even considering that I am running on somewhat older system hardware, it would appear that a CUDA initialization time of significantly less than 100 milliseconds would generally be hard to achieve. In light of this, I am not sure what the OP’s “1 ms” time refers to.
A little misunderstanding here : I used the word “cuda process” to talk about “some code execution using CUDA”, this is in no way a OS process with a PID.
To be clear, I have a single OS process (my program), that calls my “algorithm based on cuda” in a loop (~950 iterations/s on my RTX 3090)
Then see @njuffa’s post:
On a reasonably current x86-64 platform, you might observe something like 2.5 microsecond kernel launch overhead with TCC, 3.5 microsecond average kernel launch overhead with WDDM (with occasional peaks of up to 20 microsecond kernel launch overhead).
So the launch overhead with either TCC or WDDM should have less relevancy in your case, except if your algorithm each time calls lots of kernels.
I have just performed a new test, and I confirm what I said.
Please find below :
- test1 :
v2 live frozen benchmark 1024x1024
: a test where processing is prominent (kernel launch overhead is irrelevant regarding process time) - test 2 :
v2 live frozen benchmark 256x256
: a test where processing is very light (kernel launch overhead is not negligible at all) - a CPU-Z report about current hardware
The test2 really show a huge difference between WDDM and TCC mode
(Please note this host machine is Windows 10, while my initial observation were on Windows 11, but it does not really matter here)
v2 live frozen benchmark 1024x1024
WDDM : observed : ~98 process/s
TCC : observed : ~103 process/s
v2 live frozen benchmark 256x256
WDDM : observed : ~950 process/s
TCC : observed : ~1530 process/s
CPU-Z export : HOST.zip (17.3 KB)
Not to be petty but a performance factor of 1.6x is not the same as a performance factor of 6x claimed earlier.
My consistent observations over many years are that kernel launch overhead is generally in the single digit microseconds. This includes the average kernel launch overhead with WDDM. To cause the percentual overhead for the WDDM configuration indicated by the second test run, the kernel execution time would have to be extremely short, on the order of 10 microseconds or so. You could use the CUDA profiler to measure the kernel execution time and thus confirm or refute this hypothesis.
Based on the attached report from CPU-Z, your host system appears to be a very modern, high-frequency, low-latency system based on Intel Xeon w5-3425. I would therefore expect kernel launch overhead to be in the very low single-digit microsecond range on this system, similar to the example numbers I gave earlier.
I already provided a list of GPUs from NVIDIA’s professional line that are in the ballpark of the RTX 3090 performance-wise (by one metric or other) that you could research in more detail to determine the best GPU for your needs. Good luck!

Not to be petty but a performance factor of 1.6x is not the same as a performance factor of 6x claimed earlier.
Absolutely. At the time of writing the original post, I could not remember the exact figures produced by my myriad of differents tests. But 1.6x is stil very significant !

My consistent observations over many years are that kernel launch overhead is generally in the single digit microseconds(…)Based on the attached report from CPU-Z, your host system appears to be a very modern, high-frequency, low-latency system based on Intel Xeon w5-3425. I would therefore expect kernel launch overhead to be in the very low single-digit microsecond range on this system, similar to the example numbers I gave earlier.
Exactly, that’s why that limit of 950 process/s is really a problem for me.
But it proves that TCC mode makes a real difference in some real-life scenarios.
I already provided a list of GPUs from NVIDIA’s professional line that are in the ballpark of the RTX 3090 performance-wise (by one metric or other) that you could research in more detail to determine the best GPU for your needs. Good luck!
Yep, that’s on me now. But I hope that NVidia will work on MCDM and low latency modes for GeForce…
Does your ‘process’ start one or several kernels?