[550.67] Nvidia Vulkan ICD wakes up dgpu on initialization and exit

jrelvas · April 1, 2024, 7:25pm

The Vulkan ICD provided by Nvidia’s driver always wakes up the dgpu on hybrid graphic systems, even if the gpu chosen by the program is not the Nvidia dgpu.

When running a Vulkan program, its launch will be delayed for several seconds, while the card is resumed. Once the program launches, the card goes back to sleep after a while:

Exiting the program also requires the card to wake up, resulting in another delay:

This affects both x11 and wayland programs:

The only workaround is to disable the nvidia ICD, as seen here:

(Note that the nouveau_icd was also disabled in other tests, so it’s not relevant.)

Symptoms are very similar to those found in this egl-wayland issue, but it’s with the Vulkan implementation instead of EGL:

github.com/NVIDIA/egl-wayland

egl needs an early out to prevent waking the dGPU unnecessarily

opened 09:55AM - 29 Sep 23 UTC

flukejones

On the last two/three years of hybrid laptops, notably Nvidia RTX20xx++ onwards …these machines tend to have a better/deeper suspend function which puts the dgpu in to a very low power state when unused. Combined with glvnd, this introduces a lag or 1-2 seconds while the dgpu wakes in response to queries. Even if it remains unused and the iGPU is used instead. For example opening Nautilus file manager is delayed 1-2s while the dGPU wakes. For a lot of apps that use glvnd this ends up being a bad UX. A lot of folks are working around this with __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/50_mesa.json. I reported this [here](https://gitlab.freedesktop.org/glvnd/libglvnd/-/issues/240#note_2090101) some time ago

Here’s some additional system details:

Machine: Thinkpad P1 Gen 6
- CPU: Intel Core i7-13800H
- Graphics: Intel(R) Graphics (RPL-P), NVIDIA RTX 4000 Ada Generation Laptop GPU
OS: Fedora (Rawhide/41)
Nvidia Driver Version: 550.67 (open kernel module)
- Note: NVreg_EnableS0ixPowerManagement is set to 1
Mesa Version: 24.03 (Also tested on git main, commit fcb568a5d5a52db75fa2f6d04579bb404ca7f597)
vulkaninfo output: vulkaninfo.txt (193.4 KB)

I’ve recorded a few videos showing the symptoms in more detail - will share if needed.

jrelvas · May 4, 2024, 3:13pm

This issue is still present as of 550.78.

It currently isn’t that big of a deal, as most Vulkan programs desire to run in the dgpu anyways…

However, it will be extremely important in the near future as Vulkan starts being adopted by programs and libraries which value power efficiency over performance.

For example, GTK4 has started using its Vulkan renderer by default with v4.15 (testing), in order to determine if it’s ready for the next production version, v4.16.

WIth this bug, all GTK4 programs wake up the nvidia dgpu during their initialization. This causes several seconds of delay before a window appears and reduces battery life on laptops.

GTK’s developers will likely not hold off on migrating to the Vulkan renderer because of nvidia driver bugs, so fixing this issue should be considered a priority.

jrelvas · May 5, 2024, 4:12pm

I’ve written a small program which reproduces the issue:

#include <stdio.h>
#include <unistd.h>
#include <time.h>

#include <vulkan/vulkan.h>

int main (void)
{
  const VkInstanceCreateInfo vkInstanceInfo = {};
  VkInstance instance;

  struct timespec start, end;
  float delta_t;

  printf("calling vkCreateInstance...\n");

  clock_gettime(CLOCK_MONOTONIC_RAW, &start);
  vkCreateInstance(&vkInstanceInfo, NULL, &instance);
  clock_gettime(CLOCK_MONOTONIC_RAW, &end);

  delta_t = (end.tv_sec - start.tv_sec) + (float)(end.tv_nsec - start.tv_sec) / (1000 * 1000000);
  printf("vkCreateInstance done in %.3fs\n", delta_t);

  sleep(30);

  printf("Exiting...\n");
  return 0;
}

vkCreateInstance is the call which causes NVIDIA’s driver to wake up the gpu unnecessarily. vkEnumeratePhysicalDevices doesn’t appear to be affected.

Here’s an example of how output looks like:

Running the program with nvidia_icd disabled works as expected. vkCreateInstance returns near-instantly.

Running the program normally adds a delay of around 2 seconds to vkCreateinstance. This is consistent with the time it takes for the dgpu to go into D0.

jrelvas · May 5, 2024, 4:30pm

using gdb to debug reveals that the gpu wake up happens inside of a chain of functions called by nvidia’s vk_icdNegotiateLoaderICDInterfaceVersion.

Some of the IOCTLs sent by nvidia’s userspace appear to wake up the dgpu. The first three IOCTLs do not wake up the gpu, however, the fourth (and some after) do.

jrelvas · May 22, 2024, 10:09pm

Still occurs as of 555.42.02.

jrelvas · June 27, 2024, 10:04am

Still occurs as of 555.52.04.

jrelvas · July 26, 2024, 10:49am

Issue persists as of 560.28.03.

amrits · July 26, 2024, 11:42am

Hi @jrelvas
Apoligies for the late response, I have filed a bug 4770124 internally for tracking purpose.

amrits · August 5, 2024, 11:57am

Hi @jrelvas
I performed tests on couple of notebooks, but it doesn’t seem like I am having exact repro locally.
Could you please help to share repro videos and nvidia bug report for my reference.

jrelvas · August 5, 2024, 6:11pm

Of course!

Here’s a screen recording of the issue in question:

Left terminal shows the repro program’s output, while the right terminal shows the power state of the nvidia gpu. You’ll notice that the dgpu wakes up as soon as a Vulkan instance is created and that there’s an associated delay.

Make sure to perform this test when the GPU’s already at d3cold, otherwise you won’t be able to see it in effect.

Here’s a tarball containing the source of the repro program:
nvidia-wakeup-repro.c.tar.gz (1.0 KB)

And here’s the nvidia bug report log you’ve asked for:
nvidia-bug-report.log.gz (424.2 KB)

amrits · August 6, 2024, 10:47am

Thanks @jrelvas for sharing new source code, I was able to repro reported issue internally.
Engineering team will further review it now.

amrits · August 7, 2024, 8:36am

Most Vulkan applications begin by enumerating all devices in the system and selecting one or more based on their capabilities. Currently, the NVIDIA driver must power on GPUs to discover their capabilities during this enumeration phase. The engineering team is investigating methods to perform these operations without powering on the GPU, but we cannot commit to an ETA for such a solution at this time.

Topic		Replies	Views
GPU randomly fails to wake up from D3Cold Linux	2	217	October 18, 2024
570 release feedback & discussion Linux	654	41618	August 21, 2025
How to deal with nvidia-modprobe when switching between nvidia/nouveau Linux	9	8206	December 30, 2023
NVIDIA driver 440.44 / Vulkan vulkaninfo reports too many GPUs Linux	3	798	October 12, 2021
vulkan failed to initialize with 440.36 on ubuntu 18.04 Linux	8	5054	October 12, 2021
Crash on Wayland with Vulkan apps Linux	11	735	March 7, 2024
Vulkaninfo failure when using cgroups to limit GPUs on CUDA11.2 driver Vulkan	1	1357	April 22, 2021
Headless Vulkan with multiple GPUs Linux	20	5372	February 11, 2025
VK_KHR_present_{id,wait} causes device loss on Nvidia 525.60.11 on PRIME setup Linux nvbugs , vulkan , linux , linux-driver	5	2445	March 25, 2023
Trouble using dGPU acceleration with Vulkan (Driver 470, Mesa 20.8, Loader 1.3.224) Linux	1	742	October 19, 2022

[550.67] Nvidia Vulkan ICD wakes up dgpu on initialization and exit

Related topics