364.19 Linux/X11 - Presenting from more than 2 queues causes hangs/VK_ERROR_DEVICE_LOST.

HypnoGenX · May 20, 2016, 8:44pm

Using nVidia drivers 364.19 for Linux, with a GTX 780, the implementation gives me access to one queue family supporting graphics work, with 16 queues. Doing nothing but submitting a simple image clearing command buffer, and presenting on that same queue (then waiting for queue idle), problems start to occur when more than two queues within that single queue family are used.

I ran into this while experimenting with multiple output windows (on a single monitor), with a single swapchain for each surface, and a dedicated queue per swapchain, but the problem also occurs using only one surface and one swapchain.

As soon as commands/present-calls are sent to more than two queues, performance grinds to a halt after precisely six frames, regardless of number of queues used or the size of the swapchain. Depending on whether I use a swapchain of two images or more, I either get an indefinite hang while waiting on the single submit’s fence, or VK_ERROR_DEVICE_LOST from the submit or present calls.

Below I’ve linked the smallest fail case I could come up with, with more details;

http://pastebin.com/XxKFGSpB

Am I doing something wrong? I’m still shaky on a lot of the details of Vulkan, and I’ve been looking at this problem for way too long to actually spot anything anymore, so I can’t rule out some dumb mistake on my part.

As far as I’ve been able to find, the Vulkan specification doesn’t specifically state whether or not presentation has to be limited to a single queue (just that the queue used has to be from a family compatible with the presentation surface). But whether this signifies an issue with the specification, or if it counts as a bug in the nVidia implementation, is above my paygrade.

HypnoGenX · May 21, 2016, 12:51pm

I’ve done some more experimentation around this issue and discovered that the fail case can be shifted from three queues up to four queues by replacing the call to vkQueueWaitIdle() with a zero command buffers fence-only vkQueueSubmit() call, and an immediately following vkWaitForFences() busy loop.

Without the use of vkQueueWaitIdle(), the program runs fine for any number of swapchain images presented across three queues, and only fails when a fourth queue is introduced.

I also experimented with using different sets of consecutive queue indices, and random selections of non-consecutive queues within the queue family, but that hasn’t had any bearing on the issue.

Updated version that replaces vkQueueWaitIdle() with vkQueueSubmit() and vkWaitForFences();

http://pastebin.com/8C8P5eBs

Edit update: Intentionally sleeping between frames (for anywhere from 1ms to 1000ms) has no influence on the issue either, so it doesn’t seem to be sensitive to timing.

HypnoGenX · June 13, 2016, 12:10pm

It’s been three weeks without a response or a fix, and the issue persists. I’ve since tested this in Windows 10 on driver version 365.10, and the problem never manifested. Any number of queues presented to works just fine there.

While I was at it I ran into some odd disparities between the capabilities reported by Vulkan on both platforms, on the same machine with a GTX 780 graphics card;

On Windows:

- Queue families supported: 1
    Queue family 0
      - Supports: Graphics Compute Transfer SparseBinding
      - Queues: 16
- Device has 3 available presentation mode(s):
    VK_PRESENT_MODE_FIFO_KHR
    VK_PRESENT_MODE_FIFO_RELAXED_KHR
    VK_PRESENT_MODE_MAILBOX_KHR
- Surface Capabilities:
    - Minimum swapchain image count: 1

On Linux under X11:

- Queue families supported: 2
    Queue family 0
      - Supports: Graphics Compute Transfer SparseBinding
      - Queues: 16
    Queue family 1
      - Supports: Transfer 
      - Queues: 1
- Device has 3 available presentation mode(s):
    VK_PRESENT_MODE_FIFO_KHR
    VK_PRESENT_MODE_FIFO_RELAXED_KHR
    VK_PRESENT_MODE_IMMEDIATE_KHR
- Surface Capabilities:
    - Minimum swapchain image count: 2

The change from MAILBOX to IMMEDATE mode support I would assume could be chalked up to platform differences. The minimum swapchain size I’m less sure about, and the difference in queue families I find downright weird.

HypnoGenX · June 16, 2016, 9:11am

nVidia driver 367.27, Linux kernel 4.6.2; Instead of fixing the problem, this new driver makes it impossible to use more than a single queue before throwing a VK_ERROR_DEVICE_LOST.

On top of that, vkAllocateMemory() now segmentation faults where it didn’t before, and there seems to be an undocumented new extension called ‘VK_NV_dedicated_allocation’.

Edit update: vkAllocateMemory()'s crash was a case of PEBKAC. I was relying on broken behavior that apparently has been fixed.

ManuDev · June 23, 2016, 8:54am

From where did you get 367.27 for Linux? The nVidia download page is still at 367.18.

Thanks!

Edit: I was only looking at the Vulkan driver page, not realizing that the official Linux driver page seems to have a more recent version of the driver. Am I right that 367.27 is a replacement for 367.18?

HypnoGenX · June 24, 2016, 12:14pm

I rely on Gentoo’s package maintainers to keep my driver up to date, so never had to download the drivers directly myself. As such I also have no idea if they maintain different branches or not, but I trust the number increment to mean ‘improved’.

As an aside, I never got a notification that there was a reply to this thread. Between never hearing a peep back about my issue, and the apparently broken notifications, these forums are feeling kind of neglected by nVidia. :-\

Cord · August 3, 2016, 4:08pm

HypnoGenX, I’m seeing the same issue in my code. I’m not using multiple queues because the Quadro K600 I’m using supports graphics and present from queue family 0, but I am running ubuntu 14.04.1 and 3.13.0-92-generic, and nvidia driver 367.35.

I’m happy to boot into gentoo (I’m a gentoo fan) and chase this issue down.

Can you contact me via email? I pmed you my email address. Or, would you open an issue at GitHub - KhronosGroup/Vulkan-LoaderAndValidationLayers: **Deprecated repository** for Vulkan loader and validation layers - I follow that pretty closely.

tl;dr maybe the same issue, I’m only submitting on 1 queue, getting VK_ERROR_DEVICE_LOST.

Cord · August 3, 2016, 7:37pm

Ok, it sounds like this is not the same issue.

HypnoGenX · August 17, 2016, 10:02am

Linux kernel 4.7.1, new nVidia drivers 370.23.

The bug seems to be fixed. I can now present any number of swapchain images on any number of queues without issue.

And it looks like the release notes confirm it;

[url]https://devtalk.nvidia.com/default/topic/957782/unix-graphics-announcements-and-news/linux-solaris-and-freebsd-driver-370-23-beta-/[/url]

Topic		Replies	Views
Problems with VK_KHR_swapchain Vulkan	5	5231	September 30, 2018
Severe user input lag in Vulkan on Windows Vulkan	5	818	July 26, 2024
presentKHR still blocks on windows even when using VK_KHR_present_wait Vulkan	4	1342	April 21, 2025
Poor multithreading performance compared to DX12 Vulkan	17	5445	September 29, 2020
Headless Vulkan with multiple GPUs Linux	20	5063	February 11, 2025
Presentation in Latest Nvidia driver [545.29.02-4] appears to be bugged Linux	7	2866	April 22, 2024
Vulkan App with VK_PRESENT_MODE_FIFO_KHR (VSync) causes desktop stuttering across entire system when moving or resizing any window. (Linux/X11) Vulkan	12	8730	February 8, 2024
Suspend swap group failed / Resume swap group failed. Nvidia 390.25 Linux	19	5487	May 15, 2018
Hangs/Freezes when Vulkan v-sync (VK_PRESENT_MODE_FIFO_KHR) is enabled Linux	39	13882	January 11, 2021
VK_ERROR_DEVICE_LOST on second frame when using two submissions Vulkan	0	3238	August 8, 2017

364.19 Linux/X11 - Presenting from more than 2 queues causes hangs/VK_ERROR_DEVICE_LOST.

Related topics