CUDA blocks all threads when doing a Device to Host MemCpyAsync to a pageable host memory location

CUDA docs say that async mem copies to a pageable host location will behave in a synchronous manner, which means that the host thread calling that memcpy will block. However, im seeing that all other host threads are also blocked and not able to complete any CUDA api call they were in the process of calling. For example, a host threads get stuck at cudaEventDestroy (checked by gdb), when another thread is in a cudaMemCpyAsync to a pageable location. Shouldn’t other threads be able to continue with their “interactions” with CUDA driver?

As indicated in comments on your cross posting, that expectation is incorrect.

Threads do not have independent, unfettered access to the CUDA API in all cases. From here:

Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources. Such behavior is subject to change and undocumented behavior should not be relied upon.

I believe the term “synchronous behaviour” is also not uniformly defined across CUDA APIs. When a thread calls cudaStreamSynchronize, other threads are able to continue with their respective cuda operations like cudaEventDestroy. However, the same is not observed when cudaMemCpyAsync behaves like a synchronous API (due to pageable copies).