Driver hang on exit due to timing issue with cuda dripver api, optix, vulkan external memory

Could be a difficult one to nail down but I’d really like to understand what’s going on here.

I have an application that uses OptiX to render to a Vulkan buffer shared using external memory (using OPAQUE_FD). Vulkan then copies said buffer to a vkImage and samples it onto a fullscreen tri for presentation. All Vulkan stuff is on the main thread including calling close() on the external memory FD, and all cuda/optix stuff (including cuInit(), creating and destroying the context) happens on a second thread which is joined before shutdown, so I know all that stuff completes successfully.

Everything works as expected, except that if I import the external memory on the cuda side before the first call to vkQueuePresent(), then the application will hang at application exit:

0x00007ffff24bed2d in __GI___pthread_timedjoin_ex (threadid=140735360579328, thread_return=0x0, abstime=0x0, block=<optimized out>) at pthread_join_common.c:89
89	pthread_join_common.c: No such file or directory.
(gdb) bt
#0  0x00007ffff24bed2d in __GI___pthread_timedjoin_ex (threadid=140735360579328, thread_return=0x0, abstime=0x0, block=<optimized out>) at pthread_join_common.c:89
#1  0x00007fffed113c3b in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.91.03
#2  0x00007fffed14e14a in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.460.91.03
#3  0x00007fffee4fa66c in ?? () from /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.0
#4  0x00007fffee4fae4f in ?? () from /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.0
#5  0x00007ffffffebfe0 in ?? ()
#6  0x00005555579f99b0 in ?? ()
#7  0x00007ffffffec100 in ?? ()
#8  0x00007fffee584009 in ?? () from /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.0
#9  0x00007ffffffec100 in ?? ()
#10 0x00007ffff7de8e40 in _dl_close_worker (map=<optimized out>, force=<optimized out>) at dl-close.c:293
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

I’m on ubuntu 18.04, cuda-10.1, optix-7.2, driver 460.91.03, RTX 2070 Max-Q.

If I delay the call to cuImportExternalMemory() with a short sleep() just before it, then the application will exit successfully. Note that this is completely reliable, not a heisenbug - as far as I can tell after a fair amount of experimentation it is 100% dependent on whether cuImportExternalMemory() is called first on the second thread (hang on exit) or vkQueuePresent() is called first on the main thread (exit successfully).

It also seems to be just the first vkQueuePresent() that matters - in the case where it hangs on exit, application behaviour while it is running is completely normal. I can resize the window, destroying, reallocating the vkImage and vkBuffer and reimporting it on the cuda side with no validation errors and no errors reported by cuda (I check every single CUresult).

Assuming I’m correct about the cause here I can work around this by stalling the optix thread until I’ve done the first queue present on the main thread, but I’d really like to understand what’s happening here so I don’t inadvertantly trigger the bug again.