Large standard deviation difference in performance of kernels for Windows vs Linux

Context:

My project is a image processing pipeline that has an API interface so that users can integrate it into a real time system. The frames go in as an input, several CUDA kennels get executed and then our pipeline outputs back to the application. Across the board the problem below is happening to all of our CUDA kernels in Windows, but only one example is shown below.

Problem:

Windows:
System:

  • Windows 11
  • CUDA 12.3
  • NVIDIA RTX 4090

In Windows 11, CUDA 12.3 we are experiencing a large standard deviation between the Min and Max CUDA kernel execution times. In addition, in Windows if we increase the number of inputs/the number of times the kernel is ran, the standard deviation grows as the number of instances grows. More specifically, it’s just the “Max” time cost that grows in windows as the number of instances grows.

Note: Perhaps we need to test on a higher magnitude of instances/executions, but the problem/slow down seems to scale as we call the kernel more and more times. So I wasn’t quite sure if this is just a “loss in performance” cited here: c - Pros and cons of CUDA on Linux vs Windows? - Stack Overflow

Time Total Time Instances Avg Med Min Max StdDev Name
1.7% 30.943 ms 100 309.429 μs 391.051 μs 15.008 μs 3.175 ms 379.568 μs foo(…)

Linux:
System:

  • System Linux 24.04
  • CUDA 12.8
  • NVIDIA RTX A6000

In Linux 24.04 CUDA 12.8 we are experiencing no such thing.

Time Total Time Instances Avg Med Min Max StdDev Name
1.6% 6.679 ms 100 66.793 μs 66.768 μs 66.304 μs 67.776 μs 284 ns foo(…)

Questions:
Is it more likely that I have a bug, or is this a Windows dependent behavior?

As a sanity check, for CUDA on Windows what is a “statistically sound/safe” number of times a function should be run in Windows to get a correct average performance? Mainly to help me deduce whether I have a bug or if this is an expected behavior.

If your GPU on windows is in WDDM mode, there are a number of factors there that will probably increase the statistical variance in behavior. The thing that surprised me the most is the difference between the average and the minimum number on windows. I would expect windows WDDM interference to be able to make an observation substantially longer than “typical” but not substantially shorter. that would make me wonder if you have data-dependent work variation in your kernel. If this is the same function, it’s also not clear how the minimum on windows could be 4x faster than on linux, unless the GPUs in question are radically different.

1 Like

Hi Robert,

Thank you for your reply.

WDDM seems to be product dependent. Our team is using NVIDIA RTX 4090’s. After a quick google search it doesn’t seem like there is a work around for handling/dealing with WDDM in the context of this problem we are facing, correct?

The Linux GPU is an NVIDIA RTX A6000, sorry for leaving this information out. I will edit the post.

Correct, with RTX 4090, on Windows, WDDM is the only driver model option.

1 Like

I see, thank you again!

There is no way an RTX A6000 is either 4x faster or 4x slower than a RTX 4090. Therefore, I question the comparability of the data/test cases. It’s not logical that the minimum observation in one setting could be 1/4 the minimum observation in the other setting, unless the workloads are actually different or variable. If the workloads are variable then all the data here is meaningless.

For the same kernel and workload and GPU, the lower bound on kernel duration on windows should be pretty close to the lower bound on kernel duration on linux. Windows vs. Linux, even with WDDM, does not affect the speed of device code execution. Therefore the data or underlying assumptions look suspect to me.

You don’t have the same GPU here, but the difference between RTX 4090 and RTX A6000 does not suggest to me a plausible reason for a 4x difference in lower bound observation.

If you put the RTX 4090 in the linux machine, and the RTX A6000 in the windows machine, I believe it should be possible to put the RTX A6000 in TCC mode, which should take WDDM “disruption” out of the picture. You would need another display path for the windows machine in that case.

1 Like

On a third machine with the following specs:

  • Windows 11
  • CUDA 12.3
  • NVIDIA RTX 4090

The third machine generates the following results for the same kernel function “foo(…)”

Time Total Time Instances Avg Med Min Max StdDev Name
0.90% 58.997 ms 100 589.966 µs 654.874 µs 55.170 µs 1.427 ms 227.179 µs foo(…)

I am unsure why the first Windows machine has such a faster lower bound.

We can do some double checking to make sure that the test cases are the same and that there are no bugs in the test case.

How are you measuring time?
What other processes are running on the GPU for the Windows systems and the Linux system?

A exceeding high duration is often explained by one of the following:

  1. The timing method is not capturing the asynchronous execution for time the kernel was on the GPU.
    a. Using CPU high precision timer and the CUDA code is not flushing the GPU work. On WDDM work is enqueued in a command buffer but not necessarily submitted to the KMD to driver. cudaEventQuery(0) has historically been used to flush the data. cudaDevice/StreamSynchronize() will also flush.
    b. Using CUevent/cudaEvent to time the GPU but the cudaEventRecord for start or start is not executed in the same command buffer as the kernel execution. See (a) for how to flush the work.

  2. The time recording includes a GPU context switch. If the windows system has a GPU with multiple active GPU contexts and display there is a high likelihood that there is a GPU GR engine context switch in between the start and end timestamp record. The duration is often calculated as end - start, but does not remove time on a different context. Nsight Systems has the ability to trace GPU GR engine context switches. Nsight Systems does not remove time a different context is executing from the duration of a grid.

  3. WDDM allows for over subscription of memory. Depending on the timing method the time may include more than the kernel such as time for WDDM to page into GPU memory the contexts memory

I do not trust the measurements without seeing additional information. I would recommend using Nsight Compute or Nsight System (may want to look at Win ETW kernel trace) to see if other contexts are perturbing your results.

1 Like

We measure time two different ways.

The main way is that we use nsys profile --stats=true --cuda-memory-usage=true --show-output=true <.exe> so that we can analyze the performance at a kernel level. The data reported earlier is from NIVIDA Nsight Systems, “Stats System View” → “CUDA GPU Kernel Summary” (see the figure below, this data corresponds to the Linux machine).

Not reported in this post, the second is using std::chrono so that we can also somehow measure the cost of any CPU operations and GPU operations from a more wholistic approach.

auto start_launch_foo = std::chrono::steady_clock::now();
launch_foo(...);
auto end_launch_foo = std::chrono::steady_clock::now();
std::chrono::duration<double> launch_foo_time = end_launch_foo - start_launch_foo;
total_launch_foo_time += launch_foo_time.count();

Note, Ideally we would like to profile both the GPU and CPU portions of the code in the same profiler, but have not found the right tool yet.

Update:

We’ve narrowed down the problem to specifically OpenCV (default installation) in Windows 11. When a call to OpenCV is made especially during a cv::imwrite(…) we observe a slow down across all CUDA kernels. Once the OpenCV calls are omitted, we see that standard deviation for time-cost decreases across all kernels, and the degradation in time-cost proportional to the number of frames goes away.

OpenCV in Linux doesn’t behave the same way or the cost of cv::imwrite(…) is a lot smaller/does not have conflicting overhead.

I lack the knowledge, but something leads me to believe that OpenCV and WDDM are related from what Gerg has described in 1-3.

Note, we only use OpenCV for file read/write.

If the OpenCV build is built with CUDA enabled (which wouldn’t seem to be necessary, if you are only using it for file read/write), then I think it might be possible that starting up OpenCV could create a default context on a GPU, and this could disturb other processing that is taking place on the GPU.

Likewise OpenCV might be doing something with graphics when it starts up, and as Greg mentioned a (creation of a) graphics context on the WDDM GPU could inject a disturbance into the compute context processing.

1 Like

For a controlled experiment (in which only one variable is changed), you could move the NVIDIA RTX A6000 to your Windows system and operate it with the TCC driver (in which case you will need a second GPU to serve the GUI, but even a low-end GPU will serve just fine for that).

If that leads to observed behavior more similar to that seen on Linux, it is a strong indication that WDDM is the root cause of the issue.

Generally speaking I would advise against deploying consumer cards using the WDDM driver on Windows for professional applications with tight performance and / or performance variability constraints, such as your soft real-time environment. Due to a lack of timing guarantees throughput their software stack (all work is performed using “best effort”), at present GPUs are not suited to environments with hard real-time requirements.

1 Like