Large standard deviation difference in performance of kernels for Windows vs Linux

calvin.chang · April 24, 2025, 9:24pm

Context:

My project is a image processing pipeline that has an API interface so that users can integrate it into a real time system. The frames go in as an input, several CUDA kennels get executed and then our pipeline outputs back to the application. Across the board the problem below is happening to all of our CUDA kernels in Windows, but only one example is shown below.

Problem:

Windows:
System:

Windows 11
CUDA 12.3
NVIDIA RTX 4090

In Windows 11, CUDA 12.3 we are experiencing a large standard deviation between the Min and Max CUDA kernel execution times. In addition, in Windows if we increase the number of inputs/the number of times the kernel is ran, the standard deviation grows as the number of instances grows. More specifically, it’s just the “Max” time cost that grows in windows as the number of instances grows.

Note: Perhaps we need to test on a higher magnitude of instances/executions, but the problem/slow down seems to scale as we call the kernel more and more times. So I wasn’t quite sure if this is just a “loss in performance” cited here: c - Pros and cons of CUDA on Linux vs Windows? - Stack Overflow

Time	Total Time	Instances	Avg	Med	Min	Max	StdDev	Name
1.7%	30.943 ms	100	309.429 μs	391.051 μs	15.008 μs	3.175 ms	379.568 μs	foo(…)

Linux:
System:

System Linux 24.04
CUDA 12.8
NVIDIA RTX A6000

In Linux 24.04 CUDA 12.8 we are experiencing no such thing.

Time	Total Time	Instances	Avg	Med	Min	Max	StdDev	Name
1.6%	6.679 ms	100	66.793 μs	66.768 μs	66.304 μs	67.776 μs	284 ns	foo(…)

Questions:
Is it more likely that I have a bug, or is this a Windows dependent behavior?

As a sanity check, for CUDA on Windows what is a “statistically sound/safe” number of times a function should be run in Windows to get a correct average performance? Mainly to help me deduce whether I have a bug or if this is an expected behavior.

Robert_Crovella · April 24, 2025, 9:34pm

If your GPU on windows is in WDDM mode, there are a number of factors there that will probably increase the statistical variance in behavior. The thing that surprised me the most is the difference between the average and the minimum number on windows. I would expect windows WDDM interference to be able to make an observation substantially longer than “typical” but not substantially shorter. that would make me wonder if you have data-dependent work variation in your kernel. If this is the same function, it’s also not clear how the minimum on windows could be 4x faster than on linux, unless the GPUs in question are radically different.

calvin.chang · April 24, 2025, 9:41pm

Hi Robert,

Thank you for your reply.

WDDM seems to be product dependent. Our team is using NVIDIA RTX 4090’s. After a quick google search it doesn’t seem like there is a work around for handling/dealing with WDDM in the context of this problem we are facing, correct?

The Linux GPU is an NVIDIA RTX A6000, sorry for leaving this information out. I will edit the post.

Robert_Crovella · April 24, 2025, 9:42pm

Correct, with RTX 4090, on Windows, WDDM is the only driver model option.

calvin.chang · April 24, 2025, 9:44pm

I see, thank you again!

Robert_Crovella · April 24, 2025, 9:44pm

There is no way an RTX A6000 is either 4x faster or 4x slower than a RTX 4090. Therefore, I question the comparability of the data/test cases. It’s not logical that the minimum observation in one setting could be 1/4 the minimum observation in the other setting, unless the workloads are actually different or variable. If the workloads are variable then all the data here is meaningless.

For the same kernel and workload and GPU, the lower bound on kernel duration on windows should be pretty close to the lower bound on kernel duration on linux. Windows vs. Linux, even with WDDM, does not affect the speed of device code execution. Therefore the data or underlying assumptions look suspect to me.

You don’t have the same GPU here, but the difference between RTX 4090 and RTX A6000 does not suggest to me a plausible reason for a 4x difference in lower bound observation.

If you put the RTX 4090 in the linux machine, and the RTX A6000 in the windows machine, I believe it should be possible to put the RTX A6000 in TCC mode, which should take WDDM “disruption” out of the picture. You would need another display path for the windows machine in that case.

calvin.chang · April 24, 2025, 10:26pm

On a third machine with the following specs:

Windows 11
CUDA 12.3
NVIDIA RTX 4090

The third machine generates the following results for the same kernel function “foo(…)”

Time	Total Time	Instances	Avg	Med	Min	Max	StdDev	Name
0.90%	58.997 ms	100	589.966 µs	654.874 µs	55.170 µs	1.427 ms	227.179 µs	foo(…)

I am unsure why the first Windows machine has such a faster lower bound.

We can do some double checking to make sure that the test cases are the same and that there are no bugs in the test case.

Greg · April 24, 2025, 11:14pm

How are you measuring time?
What other processes are running on the GPU for the Windows systems and the Linux system?

A exceeding high duration is often explained by one of the following:

The timing method is not capturing the asynchronous execution for time the kernel was on the GPU.
a. Using CPU high precision timer and the CUDA code is not flushing the GPU work. On WDDM work is enqueued in a command buffer but not necessarily submitted to the KMD to driver. cudaEventQuery(0) has historically been used to flush the data. cudaDevice/StreamSynchronize() will also flush.
b. Using CUevent/cudaEvent to time the GPU but the cudaEventRecord for start or start is not executed in the same command buffer as the kernel execution. See (a) for how to flush the work.
The time recording includes a GPU context switch. If the windows system has a GPU with multiple active GPU contexts and display there is a high likelihood that there is a GPU GR engine context switch in between the start and end timestamp record. The duration is often calculated as end - start, but does not remove time on a different context. Nsight Systems has the ability to trace GPU GR engine context switches. Nsight Systems does not remove time a different context is executing from the duration of a grid.
WDDM allows for over subscription of memory. Depending on the timing method the time may include more than the kernel such as time for WDDM to page into GPU memory the contexts memory

I do not trust the measurements without seeing additional information. I would recommend using Nsight Compute or Nsight System (may want to look at Win ETW kernel trace) to see if other contexts are perturbing your results.

calvin.chang · April 24, 2025, 11:28pm

We measure time two different ways.

The main way is that we use nsys profile --stats=true --cuda-memory-usage=true --show-output=true <.exe> so that we can analyze the performance at a kernel level. The data reported earlier is from NIVIDA Nsight Systems, “Stats System View” → “CUDA GPU Kernel Summary” (see the figure below, this data corresponds to the Linux machine).

Not reported in this post, the second is using std::chrono so that we can also somehow measure the cost of any CPU operations and GPU operations from a more wholistic approach.

auto start_launch_foo = std::chrono::steady_clock::now();
launch_foo(...);
auto end_launch_foo = std::chrono::steady_clock::now();
std::chrono::duration<double> launch_foo_time = end_launch_foo - start_launch_foo;
total_launch_foo_time += launch_foo_time.count();

Note, Ideally we would like to profile both the GPU and CPU portions of the code in the same profiler, but have not found the right tool yet.

calvin.chang · April 25, 2025, 1:03am

Update:

We’ve narrowed down the problem to specifically OpenCV (default installation) in Windows 11. When a call to OpenCV is made especially during a cv::imwrite(…) we observe a slow down across all CUDA kernels. Once the OpenCV calls are omitted, we see that standard deviation for time-cost decreases across all kernels, and the degradation in time-cost proportional to the number of frames goes away.

OpenCV in Linux doesn’t behave the same way or the cost of cv::imwrite(…) is a lot smaller/does not have conflicting overhead.

I lack the knowledge, but something leads me to believe that OpenCV and WDDM are related from what Gerg has described in 1-3.

Note, we only use OpenCV for file read/write.

Robert_Crovella · April 25, 2025, 11:46am

If the OpenCV build is built with CUDA enabled (which wouldn’t seem to be necessary, if you are only using it for file read/write), then I think it might be possible that starting up OpenCV could create a default context on a GPU, and this could disturb other processing that is taking place on the GPU.

Likewise OpenCV might be doing something with graphics when it starts up, and as Greg mentioned a (creation of a) graphics context on the WDDM GPU could inject a disturbance into the compute context processing.

njuffa · April 25, 2025, 5:46pm

For a controlled experiment (in which only one variable is changed), you could move the NVIDIA RTX A6000 to your Windows system and operate it with the TCC driver (in which case you will need a second GPU to serve the GUI, but even a low-end GPU will serve just fine for that).

If that leads to observed behavior more similar to that seen on Linux, it is a strong indication that WDDM is the root cause of the issue.

Generally speaking I would advise against deploying consumer cards using the WDDM driver on Windows for professional applications with tight performance and / or performance variability constraints, such as your soft real-time environment. Due to a lack of timing guarantees throughput their software stack (all work is performed using “best effort”), at present GPUs are not suited to environments with hard real-time requirements.

Topic		Replies	Views
Different times Ubuntu Vs Windows CUDA Programming and Performance	8	1678	October 12, 2015
First kernel execution takes longer CUDA Programming and Performance	8	2860	December 8, 2014
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1909	January 12, 2019
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2365	January 9, 2017
Deminishing performance? CUDA Programming and Performance	29	13083	March 5, 2009
Which GPU for best performance with TCC and CUDA cores (no tensors) CUDA Programming and Performance	30	345	December 6, 2024
CUDA slower in Windows 7 than in Windows XP same computer, two OSs, different run times CUDA Programming and Performance	21	18960	November 11, 2009
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1753	July 19, 2022
CUDA execution multiples of 16ms CUDA Programming and Performance	14	2056	May 30, 2015
Unexpected Synchronization Behavior in Windows vs. Linux for CUDA Async Operations with Multiple Streams CUDA Programming and Performance synchronization	6	25	March 13, 2025

Large standard deviation difference in performance of kernels for Windows vs Linux

Related topics