cuMemcpy2DAsync function block in FFmpeg with multiple gpus

I use FFmpeg5.1 in a linux server with four A10 gpus.
My machine config as follows:
image
without nvlink

Docker bases on nvidia/cuda:11.3.1-cudnn8-devel-ubuntu18.04

My work flow is to use one gpu (GPU0) for AI inference and then transfer to other gpu (GPU1 and GPU2) for post process and hardware encode. I copy the result of GPU0 to host and GPU1/2 copy the memory from host. Four threads work on GPU1 and four threads work on GPU2

Here is an occasional thread block in FFmpeg when I transfer memory from host to GPU2. The blocking stacks :

My problem is:

  1. Is it possible that the block is caused by the input param of cuMemcpy2DAsync
  2. Form the stack , it seems like one GPU2 thread calls sched_yield function but forget to release a rw pthread lock , so other GPU2 threads block as well. Dose the driver 470.57.02 in A10 has the problem in some corner case?
  3. Do you have tools to help me finding the root cause of the block?

In another test,i find more information to proof dead lock