My cuda version is 12.4. I have two two threads on host and they work in two different cuda streams.
The first thread stuck at acquiring lock when calling cudaMemcpyAsync.
The second thread is doing an llm forward. Part of the call stack of two threads in gdb is shown below.
It appears to me that this two threads are independent to each other and should be able to run in parallel, because they are working with different memory addresses at this time. What is this lock used for?