cuCtxDestroy causes access violation when using Intel TBB

We are using CUDA 8 GA2 and the Intel TBB 2019 memory allocator together in our 64-bit C++ application for Windows 10, using the Visual C++ compiler of Visual Studio 2017 (version 15.8.5).

We found the combination of CUDA and TBB malloc to be problematic, as we experience application crashes caused by access violations in TBB malloc whenever we destroy CUDA contexts. If we do not explicitly destroy the contexts by calling cuCtxDestroy but let the driver do the cleanup after our application has terminated, everything works smoothly. Likewise, if we do not link TBB malloc and destroy the contexts, the application shuts down without erros. Only in combination, there will be an access violation reported by TBB malloc as bad allocation.

This has not been a problem with Visual Studio 2013, where we used CUDA 8 and TBB malloc together without problems. Does anybody know what the cause of this access violation could be? As already mentioned, the issue only appears when calling cuCtxDestroy. Since we only destroy CUDA contexts during shutdown of our application, we simply ommit this for the time being and hope that the driver will clean up after us … however, it would be nice to have a proper application shutdown again in the long run.

It seems like we could fix this issue ourselves by enabling the Windows debug heap allocator in Visual Studio: Tools | Options | Debugging | General | Enable Windows debug heap allocator (Native only).

Doesn’t the debug heap come with some performance and memory overhead?

The differences in the heap data structures are one of the reasons one should never mix debug and release DLLs and object files used in the same binary.

It may be worth checking if your use of debug flags (/MT or /MD vs /MTd and /MDd) is consistent throughout your entire project and its dependencies (the same applies to STL iterator debug level settings)

I forgot to mention that the access violation only appeared when running the application with the debugger. Any additional penalty of the debug heap is quite tolerable, as the application is much slower during debugging anyways.