In the description of “cudaFree” API there is a note:
Note - This API will not perform any implicit synchronization when the pointer was allocated with cudaMallocAsync or cudaMallocFromPoolAsync. Callers must ensure that all accesses to the pointer have completed before invoking cudaFree. For best performance and memory reuse, users should use cudaFreeAsync to free memory allocated via the stream ordered memory allocator.
Here it is just mentioned no implicit synchronization will be done for pointers allocated from cudaMallocAsync or cudaMallocFromPoolAsync; however, for pointers allocated from other API like cudaMalloc it doesn’t explicitly say anything. Moreover, it is unclear whether the requirement in the following sentence is for any cudaFree call or just for a call with pointers allocated from cudaMallocAsync or cudaMallocFromPoolAsync.
Even though the result I got from a simple test case shows there seems to be an implicit sync inside cudaFree call for pointers allocated from cudaMalloc it will be much better for the document to provide more accurate description on the API behavior, especially for such a most frequently used API.