Vulkan program crashes on GTX 1080

I’m developing a Vulkan renderer for Minecraft. You can see all my code at https://github.com/NovaMods/nova-renderer/tree/vulkan (note that the Vulkan code is on the vulkan branch). I ran into an issue that’s held me up for over a month and I’m at my wit’s end here.

The observed behavior is this:

  • I have some GPU timestamp queries to time each of my renderpasses. I set the timestamp retrieval code to wait for the timestamps to become available. Some waiting is expected, but I’ve observed my app waiting for multiple minutes
  • Maybe the timestamps are just taking that long, but I need my code to run. I commented out all the GPU queries and ran again… and my program crashed, on the vkWaitForFences call when I was waiting for a command buffer that copied data from a staging buffer to a texture. I tried commenting that code out…
  • …and the crash moved to the vkWaitForFences call when I was waiting for my main command buffer. Something is weird here
  • I added a watchdog in a separate thread. When I submit a command buffer, I also tell the watchdog to poll the status of its fence. The watchdog polls super often - 1000 times a second - and prints out the fence’s status every time it polls, until the fence gets signaled (this is a bad idea for production code but I’m debugging). The watchdog shows that the fence gets added, it waits a little bit until the fence gets signaled, then reports the fence ass signaled… and that’s the last thing in my logs.

The things that happen, near as I can tell, are that the command buffer finishes, its fence gets signaled, and then whatever code in Nvidia’s implementation of vkWaitForFences that executes when the fence is signaled causes a segfault (I’ve seen exit code 3, which Google tells me is sent by the C library’s abort function)

However, I don’t get this crash immediately. My program can render Minecraft’s GUI just fine, and it can render a world for about a second before this error occurs. It seems to happen when there’s 50 - 60 chunks loaded. I have no idea if that’s actually relevant

I’ve observed this behavior on Windows 10 64-bit, build 1803. I have a GTX 1080 with driver 398.82 and an Intel i7 6700 CPU.

Even though my project is early in development from a features standpoint, a few people other than me have downloaded it and got it to run. Some people get a crash like I do, but some are able to run it without issue. The people who can run it seem to be on Linux or are using an AMD GPU - although there’s at least one person who didn’t have any issues on a GTX 1080 on 64-bit Windows

I do not have a minimal reproduction case. If I did, that would imply that I have some idea of the line of code that’s causing the problem

I do not know the set of hardware/software versions that my program runs on

Any help that anyone can give me would be appreciated. I’ve reached the limits of my debugging skills here

I’m not familiar with minecraft modding, but i browsed the c++ code a bit. Here are some things i noticed right of the bat:

  1. You seem to be using VK_EXT_debug_report, this extension is deprecated in favor of VK_EXT_debug_utils. I’m not sure if the validation layers care, but that is the first thing i would change, just to make sure you get the latest, most up-to-date info from the validation layers.

  2. Your main rendering loop seems to be single buffered. That is, you acquire from the swapchain, write your cb, submit with the semaphores, wait on the fence and then reset your cb. This is not going to be fast, ever. You are sequencing your CPU and GPU work. You want the CPU and GPU to run in parallel, so you need at least double buffering on the CPU side. This means at least: 2 Command pools, 2 command buffers, 2 buffer regions for matrices etc., most likely 2 copies of each descriptor set, and 2 fences. That is in addition to fixed cbs and fences for each image in the swapchain. If you use double buffering, you need to prepare the first cb before you start rendering, and then have a loop kinda like this:

  1. submit frame i%2
  2. wait for frame fence (i+1)%2
  3. reset pool (i+1)%2 and record frame into (i+1)%2
  4. acquire swapchain to index j
  5. wait on swapchain fence j
  6. submit the static swapchain cb j that blits the render result from your final render target into the swapchain image j. This submit has the wait semaphore for the acquire and the semaphore for present
  7. Present index j, ++i

To test that you got the render loop right, just have a dummy record frame function that just clears the screen to green or something, so you can test the fences etc work.

  1. Your resource/texture upload code is blocking. I think minecraft stages all resources at front, so if you are fine with freezing the loading screen until you are done, i guess it works. First, command_buffer::begin_as_single_commend should be begin_as_single_command (i’m sorry, it was really hard to ignore). Second, NUM_FRAME_DATA is a macro, but you use a vector in the constructor (could use std::array), but hardcode indices 0 and 1 in the destructor (that ones bad). The command pools seem to be per-thread, but you seem to have one cmd buffer per texture to upload. I’ll just flat out say that if you have any multi-threading (and it seems like you at least WANT that, with the cmd pools with thread ids) you will run into a giant pit of hell with this design. You cannot use a queue from 2 threads in parallel, so your entire staging code, even for 2 different textures, is not thread-safe. You will need a massive re-design if you ever want that.

According to the wiki, compiling your code needs mingw, so i cannot test it(i use msvc 2017 on Win7x64 with a GTX 970). But i have never observed any crash with fences in my own code, with multiple threads and queues, so i don’t think your problem is a driver bug. Just remove all code from your command buffers until it just draws a green screen, then add stuff back in until it crashes.

Regards