Performance drop when turning off desktop GUI during CUDA kernel execution

Hello everyone,

I’m currently testing the performance of a CUDA kernel for regex matching. Interestingly, I observed the following:

  • When the desktop GUI is running (sudo systemctl start gdm), the performance is 8.5 MB/s.
  • After disabling the GUI (sudo systemctl stop gdm), the performance drops to 7.9 MB/s.

I’ve tried fixing the GPU frequency to 1695 MHz using the following commands, but the performance difference still persists:

export CUDA_VISIBLE_DEVICES=0
sudo nvidia-smi -pm 1
sudo nvidia-smi -i 0 -lgc 1695
  • GPU: Nvidia RTX 3090
  • NVCC version: 12.0
  • Driver version: 550.135

Any ideas on what could be causing this discrepancy? Thanks!

Can you confirm that the execution time is truly bi-modal, that is, when you take repeated measurements (say, ten repetitions) while toggling the gdm status, you are always observing the previously stated performance:

sudo systemctl start gdm
[run code under test]
sudo systemctl stop gdm
[run code under test]
sudo systemctl start gdm
[run code under test]
sudo systemctl stop gdm
[run code under test]

sudo systemctl start gdm
[run code under test]
sudo systemctl stop gdm
[run code under test]

It is not clear how you measure performance (for example, what activities are included / excluded for the timed portion), and what the nature of the code being executed is. There could be quite a bit of noise in the measurements.

For example, memory intensive code usually exhibits more timing variability compared to computationally intensive code. A common “trick” to address that is to run ten times and record the fastest time.

It could also be that timing includes various initialization overhead that may be smaller when gdm is loaded.

Hi, here are the results from repeating the kernel execution with and without gdm. The issue persists.

input: 600 MB, elapsed time: 72.2054 seconds, throughput = 8.30963 MB/s
> sudo systemctl stop gdm
input: 600 MB, elapsed time: 75.3617 seconds, throughput = 7.96161 MB/s
> sudo systemctl start gdm
input: 600 MB, elapsed time: 71.0216 seconds, throughput = 8.44813 MB/s
> sudo systemctl stop gdm
input: 600 MB, elapsed time: 75.928 seconds, throughput = 7.90222 MB/s
> sudo systemctl start gdm
input: 600 MB, elapsed time: 70.9346 seconds, throughput = 8.4585 MB/s
> sudo systemctl stop gdm
input: 600 MB, elapsed time: 76.5986 seconds, throughput = 7.83304 MB/s
> sudo systemctl start gdm
input: 600 MB, elapsed time: 71.1352 seconds, throughput = 8.43464 MB/s
> sudo systemctl stop gdm
input: 600 MB, elapsed time: 75.0668 seconds, throughput = 7.99288 MB/s
> sudo systemctl start gdm
input: 600 MB, elapsed time: 71.036 seconds, throughput = 8.44642 MB/s
> sudo systemctl stop gdm
input: 600 MB, elapsed time: 75.837 seconds, throughput = 7.91171 MB/s

Here are more details about the execution:

The program being tested is a GPU regex matching program with a single kernel. It is a memory-intensive code and it can saturate the GPU’s SM. The timing mechanism is as follows:

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

kernel<grid, block>(args);

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

The observed behavior might seem counterintuitive, as I initially believed that terminating all processes consuming GPU resources (such as those related to the gdm) would be beneficial.

If the GUI just sits here statically while your CUDA code is running, the only negative impact it has is reducing the amount of GPU memory available to CUDA. The amount of memory bandwidth and computation bandwidth consumed by an idle GUI desktop is (to first order) zero.

Given that your observation appears to be stable under repetition, you might want to try and see whether this is repeatable under profiling, and if so, the profiler statistics should tell you what’s behind the performance difference.

In any kind of multi-stage memory hierarchy with caches, TLBs, multiple memory channels, etc. it can happen that differences in the location of memory allocations alone can have a noticeable impact on performance, due to different amounts of constructive and destructive interference. within the memory hierarchy.

Presumably the data used by your CUDA kernel lands on different physical addresses when running with and without gdm present. This is just a hypothesis. It may explain the entire performance difference, part of the difference, or have no impact whatsoever.

Thank you very much.

Have you included warmup calls of the kernel in your benchmark program (before the cudaEvent calls)?

Yes, I ran the kernel once for warmup.

I also timed the warmup, and since the kernel runs for over 1 minute, the warmup has little impact and takes nearly the same time as the following execution.