VK_DEVICE_LOST only on RTX devices

Hi

I’m fighting against a random Vulkan crash.
I get a VK_DEVICE_LOST after random time (seconds to minutes), when i run my program, without doing nothing else but waiting (not moving mouse, nor keyboard, etc…)
I use core 1.2 features only, nothing about RTX specifics extensions.

It seams to only appear on RTX cards.

  • my RTX2080 Win10, lastest drivers : crashs mostly < 2min

  • friend RTX2070 Win10, lastest drivers : crashs mostly < 2min

  • client RTX2060 Win10, lastest drivers : crashs mostly < 2min

  • my GTX 980 Win10, lastest drivers : run hours

  • my GTX 970m Win10, lastest drivers : run hours

  • friend GTX 1080 Win10, lastest drivers : run hours

  • my core i7 IGP Win10, lastest drivers : run hours (but slowly :D)

My program run like that :

  • update 2 buffers (Time, transform matrix)
  • draw cascaded shadows (multi view)
  • draw deferred
  • lighting and fx passes (many compute shaders)
  • render on screen
  • HUD (fps counter)

i have try reduce the gpu work, with only few lowpoly objects

  • update 2 buffers (Time, transform matrix)
  • draw cascaded shadows (multi view)
  • no draw deferred
  • no compute shaders
  • no render on screen (yes, we can’t see anything, not even a clean screen)
  • HUD (fps counter) (crash with or without)

it crash randomly, after saying device is lost : in debug mode, my callback debugbreak from everywhere in all my code. mostly in vkWaitForFences, and a few times in vkQueueSubmit. The error location is not always at the same place, it happen from start to the end instruction of a frame, it seams to be randomly too.

The first frames are doing initialisations, after that, nothing change : it’s always the same rendering, always always identical.

I try :

  • different recent drivers version, no differences
  • 2 most recent vulkan SDK, no differences

I spend so many days trying finding the problem, i can’t work without resolving it. I really ask for help !!

Démo :

  1. Run “RUN ME.exe” or “Engine/PortEye V5.exe”
    (never try resize window it’s not working)

  2. See that nothing draw on screen (execpt FPS counter)
    RV - 2021-08-17 17H18m20

  3. Wait (sometines <1 minute, sometimes 30 minutes… :-'( !!)
    I usualy read things on internet, to kill time… The window can crash with or without having the focus

  4. Crash (press “Annuler” 4 times to quit properly)

AfterMath don’t give me any idea to try resolving the crash

So i think about a specific RTX driver issue

I try removing all CommandBuffer fences synchronisation, using vkDeviceWaitIdle before and after all CommandBuffer calls, it’s still randomly crashing

Capture of my Client (RTX 2060)
RV - 2021-08-17 17H40m25

(@dwoods : I write the rest of the previous story here (Vulkan on nSight : 0xDCDCDCDC pop everywhere))