Why cudaMemcpyAsync needs to acquire lock?

My cuda version is 12.4. I have two two threads on host and they work in two different cuda streams.
The first thread stuck at acquiring lock when calling cudaMemcpyAsync.
The second thread is doing an llm forward. Part of the call stack of two threads in gdb is shown below.

It appears to me that this two threads are independent to each other and should be able to run in parallel, because they are working with different memory addresses at this time. What is this lock used for?

Please do not post pictures of code, console, etc, but paste them directly.

The first thread appears to be in cudaLaunchKernel, not cudaMemcpyAsync.

The simple answer is given by the CUDA documentation:

Any CUDA API call may block or synchronize for various reasons such as contention for or unavailability of internal resources.