Artifact / Dota 2 (with vulkan) segfault on startup inside libnvidia-glcore

Dota 2 works fine with opengl renderer though.
Vulkaninfo: https://gist.github.com/theli-ua/f59cb2d5a2ea371a564ffa0ff0f40fc0

I have the same results with every driver version that build against 4.19 kernel (drivers in 410 series, previous versions in 415 series, etc)

ERROR! VK call failed! result = VK_ERROR_OUT_OF_HOST_MEMORY ( vkCreateFence( VulkanDevice(), &fenceCreateInfo, NULL, pFence->GetPtr() ) )
ERROR! VK call failed! result = VK_ERROR_DEVICE_LOST ( vkQueueSubmit( pQueue, 1, &submitInfo, pFence->Get() ) )
ERROR! VK call failed! result = VK_ERROR_OUT_OF_HOST_MEMORY ( vkCreateFence( VulkanDevice(), &fenceCreateInfo, NULL, &presentFence.m_pFence ) )
ERROR! VK call failed! result = VK_ERROR_DEVICE_LOST ( vkQueueSubmit( VulkanQueue(), 1, &submitInfo, presentFence.m_pFence ) )

Thread 14 "VKRenderThread" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f653f54a700 (LWP 31254)]
0x00007f654cd1acb4 in ?? () from /usr/lib64/libnvidia-glcore.so.415.22
Argument list to give program being debugged when it is started is "-vulkan".
(gdb) bt
#0  0x00007f654cd1acb4 in ?? () from /usr/lib64/libnvidia-glcore.so.415.22
#1  0x00007f654260b542 in ?? () from /home/theli/.local/share/Steam/steamapps/common/Artifact/game/bin/linuxsteamrt64/librendersystemvulkan.so
#2  0x00007f654260e010 in ?? () from /home/theli/.local/share/Steam/steamapps/common/Artifact/game/bin/linuxsteamrt64/librendersystemvulkan.so
#3  0x00007f654258999b in ?? () from /home/theli/.local/share/Steam/steamapps/common/Artifact/game/bin/linuxsteamrt64/librendersystemvulkan.so
#4  0x00007f654258a070 in ?? () from /home/theli/.local/share/Steam/steamapps/common/Artifact/game/bin/linuxsteamrt64/librendersystemvulkan.so
#5  0x00007f65425d4aa9 in ?? () from /home/theli/.local/share/Steam/steamapps/common/Artifact/game/bin/linuxsteamrt64/librendersystemvulkan.so
#6  0x00007f6548532e4e in ?? () from /home/theli/.local/share/Steam/steamapps/common/Artifact/game/bin/linuxsteamrt64/libtier0.so
#7  0x00007f654e88d0da in start_thread (arg=0x7f653f54a700) at pthread_create.c:486
#8  0x00007f654e5bfd1f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
[ 2595.167017] NVRM: Xid (PCI:0000:01:00): 31, Ch 00000036, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_PROP_0 faulted @ 0x1_14700000. Fault is of type FAULT_PDE ACCESS_TYPE_WRITE
[ 2599.327966] dota2[6673]: segfault at 60 ip 00007f0c78f97581 sp 00007ffda0b0e8b0 error 4 in libnvidia-glcore.so.415.22[7f0c77e38000+11de000]
[ 2599.327972] Code: c7 83 d0 01 00 00 06 19 00 00 48 8b 85 80 00 00 00 48 8b 80 c8 00 00 00 4a 8b 94 20 10 03 00 00 48 b8 ff ff ff ff ff ff ff 3f <48> 23 42 60 48 03 83 d8 02 00 00 48 89 03 f6 85 9c 00 00 00 04 0f

I do have other vulkan things working (eg DOOM (2016) with vulkan renderer in Wine, and some dx11 games with dxvk as well)

Weird things is if I capture trace of Dota startup with vktrace I can replay it with vkreplay just fine (it only goes as far as displaying logo fullscreen though), it just complains of unexpected success:

vkreplay error: Return value VK_SUCCESS from API call (vkQueueWaitIdle) does not match return value from trace file VK_ERROR_DEVICE_LOST.
vkreplay error: Failed to replay packet_id 38, with global_packet_index 36472.
vkreplay error: Return value VK_SUCCESS from API call (vkCreateFence) does not match return value from trace file VK_ERROR_OUT_OF_HOST_MEMORY.
vkreplay error: Failed to replay packet_id 54, with global_packet_index 37226.
vkreplay error: Return value VK_SUCCESS from API call (vkQueueSubmit) does not match return value from trace file VK_ERROR_DEVICE_LOST.
vkreplay error: Failed to replay packet_id 37, with global_packet_index 37230.
vkreplay error: Return value VK_SUCCESS from API call (vkDeviceWaitIdle) does not match return value from trace file VK_ERROR_DEVICE_LOST.
vkreplay error: Failed to replay packet_id 39, with global_packet_index 37231.
vkreplay error: Return value VK_SUCCESS from API call (vkCreateFence) does not match return value from trace file VK_ERROR_OUT_OF_HOST_MEMORY.
vkreplay error: Failed to replay packet_id 54, with global_packet_index 38540.
vkreplay error: Return value VK_SUCCESS from API call (vkQueueSubmit) does not match return value from trace file VK_ERROR_DEVICE_LOST.
vkreplay error: Failed to replay packet_id 37, with global_packet_index 38544.
vkreplay error: Return value VK_SUCCESS from API call (vkDeviceWaitIdle) does not match return value from trace file VK_ERROR_DEVICE_LOST.
vkreplay error: Failed to replay packet_id 39, with global_packet_index 38545.
vkreplay error: Return value VK_SUCCESS from API call (vkQueueWaitIdle) does not match return value from trace file VK_ERROR_DEVICE_LOST.
vkreplay error: Failed to replay packet_id 38, with global_packet_index 38547.
vkreplay error: Return value VK_SUCCESS from API call (vkCreateFence) does not match return value from trace file VK_ERROR_OUT_OF_HOST_MEMORY.
vkreplay error: Failed to replay packet_id 54, with global_packet_index 44067.
vkreplay error: Return value VK_SUCCESS from API call (vkQueueSubmit) does not match return value from trace file VK_ERROR_DEVICE_LOST.
vkreplay error: Failed to replay packet_id 37, with global_packet_index 44126.
vkreplay error: Return value VK_SUCCESS from API call (vkCreateFence) does not match return value from trace file VK_ERROR_OUT_OF_HOST_MEMORY.
vkreplay error: Failed to replay packet_id 54, with global_packet_index 48107.
vkreplay error: Return value VK_SUCCESS from API call (vkQueueSubmit) does not match return value from trace file VK_ERROR_DEVICE_LOST.
vkreplay error: Failed to replay packet_id 37, with global_packet_index 48152.
vkreplay error: Return value VK_SUCCESS from API call (vkCreateFence) does not match return value from trace file VK_ERROR_OUT_OF_HOST_MEMORY.
vkreplay error: Failed to replay packet_id 54, with global_packet_index 50768.
vkreplay error: Return value VK_SUCCESS from API call (vkQueueSubmit) does not match return value from trace file VK_ERROR_DEVICE_LOST.
vkreplay error: Failed to replay packet_id 37, with global_packet_index 50784.
vkreplay error: Return value VK_SUCCESS from API call (vkCreateFence) does not match return value from trace file VK_ERROR_OUT_OF_HOST_MEMORY.
vkreplay error: Failed to replay packet_id 54, with global_packet_index 50788.
vkreplay error: Return value VK_SUCCESS from API call (vkQueueSubmit) does not match return value from trace file VK_ERROR_DEVICE_LOST.

nvidia-bug-report.log.gz (524 KB)

I dumped a full api log interleaved with validations

At some point renderer thread ends buffer, frees bunch of memory, destroys objects, destroys fence,
then calls vkQueueWaitIdle and this is where things go south

Thread 0, Frame 1:
vkQueueWaitIdle(queue) returns VkResult VK_ERROR_DEVICE_LOST (-4):
    queue:                          VkQueue = 0x55750a1122d0

VUID-vkDestroySampler-sampler-01082(ERROR / SPEC): msgNum: 0 - Cannot call vkDestroySampler on Sampler 0x195c that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to sampler must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroySampler-sampler-01082)
    Objects: 1
       [0] 0x195c, type: 21, name: (null)
Validation(ERROR): msg_code: 0:  [ VUID-vkDestroySampler-sampler-01082 ] Object: 0x195c (Type = 21) | Cannot call vkDestroySampler on Sampler 0x195c that is currently in use by a command buffer. The Vulkan spec states: All submitted commands that refer to sampler must have completed execution (https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#VUID-vkDestroySampler-sampler-01082)
Thread 0, Frame 1:
vkDestroySampler(device, sampler, pAllocator) returns void:
    device:                         VkDevice = 0x55750a2c3750
    sampler:                        VkSampler = 0x195c
    pAllocator:                     const VkAllocationCallbacks* = NULL

api_dump_.log.gz (1.06 MB)

A weird thing is that pipelineCacheUUID is returned as all FFFFFFF

vkGetPhysicalDeviceProperties(physicalDevice, pProperties) returns void:
    physicalDevice:                 VkPhysicalDevice = 0x55b2a9698060
    pProperties:                    VkPhysicalDeviceProperties* = 0x7ffdcefd1ee0:
        apiVersion:                     uint32_t = 4198484
        driverVersion:                  uint32_t = 1741012992
        vendorID:                       uint32_t = 4318
        deviceID:                       uint32_t = 5080
        deviceType:                     VkPhysicalDeviceType = VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU (2)
        deviceName:                     char[VK_MAX_PHYSICAL_DEVICE_NAME_SIZE] = "GeForce GTX 970M"
        pipelineCacheUUID:              uint8_t[VK_UUID_SIZE] = 0x7ffdcefd1ff4
            pipelineCacheUUID[0]:           uint8_t = 255
            pipelineCacheUUID[1]:           uint8_t = 255
            pipelineCacheUUID[2]:           uint8_t = 255
            pipelineCacheUUID[3]:           uint8_t = 255
            pipelineCacheUUID[4]:           uint8_t = 255
            pipelineCacheUUID[5]:           uint8_t = 255
            pipelineCacheUUID[6]:           uint8_t = 255
            pipelineCacheUUID[7]:           uint8_t = 255
            pipelineCacheUUID[8]:           uint8_t = 255
            pipelineCacheUUID[9]:           uint8_t = 255
            pipelineCacheUUID[10]:          uint8_t = 255
            pipelineCacheUUID[11]:          uint8_t = 255
            pipelineCacheUUID[12]:          uint8_t = 255
            pipelineCacheUUID[13]:          uint8_t = 255
            pipelineCacheUUID[14]:          uint8_t = 255
            pipelineCacheUUID[15]:          uint8_t = 255

while normally it is

Thread 0, Frame 0:
vkGetPhysicalDeviceProperties(physicalDevice, pProperties) returns void:
    physicalDevice:                 VkPhysicalDevice = 0x561b0f96d7c0
    pProperties:                    VkPhysicalDeviceProperties* = 0x7ffd588756f0:
        apiVersion:                     uint32_t = 4198484
        driverVersion:                  uint32_t = 1741012992
        vendorID:                       uint32_t = 4318
        deviceID:                       uint32_t = 5080
        deviceType:                     VkPhysicalDeviceType = VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU (2)
        deviceName:                     char[VK_MAX_PHYSICAL_DEVICE_NAME_SIZE] = "GeForce GTX 970M"
        pipelineCacheUUID:              uint8_t[VK_UUID_SIZE] = 0x7ffd58875804
            pipelineCacheUUID[0]:           uint8_t = 165
            pipelineCacheUUID[1]:           uint8_t = 149
            pipelineCacheUUID[2]:           uint8_t = 15
            pipelineCacheUUID[3]:           uint8_t = 57
            pipelineCacheUUID[4]:           uint8_t = 100
            pipelineCacheUUID[5]:           uint8_t = 233
            pipelineCacheUUID[6]:           uint8_t = 154
            pipelineCacheUUID[7]:           uint8_t = 48
            pipelineCacheUUID[8]:           uint8_t = 235
            pipelineCacheUUID[9]:           uint8_t = 161
            pipelineCacheUUID[10]:          uint8_t = 242
            pipelineCacheUUID[11]:          uint8_t = 208
            pipelineCacheUUID[12]:          uint8_t = 202
            pipelineCacheUUID[13]:          uint8_t = 105
            pipelineCacheUUID[14]:          uint8_t = 48
            pipelineCacheUUID[15]:          uint8_t = 191

the OUT_OF_HOST_MEMORY seems similar to https://devtalk.nvidia.com/default/topic/1026059/vulkan/vkcreateswapchainkhr-returns-vk_error_out_of_host_memory-on-950m-with-387-22/ and mine is also an optimus laptop

with a little more debugging I found that Xid errors ,. eg

[ 6628.113290] NVRM: Xid (PCI:0000:01:00): 31, Ch 0000003e, intr 10000000. MMU Fault: ENGINE GRAPHICS GPCCLIENT_PROP_0 faulted @ 0x1_10e20000. Fault is of type FAULT_PDE ACCESS_TYPE_WRITE

are caused by one of GetFenceStatus calls, driver spits that out to dmesg, application receives VK_NOT_READY… after that it’s all kinda broken, with eventual DEVICE_LOST on QueueWaitIdle

Downgrading all the way back to 396.54 stopped “NVRM: Xid … MMU Fault” messages but segfaults are still there :(

I have solved this issue by removing non-glvnd OpenGL libraries from my nvidia-drivers install