why cudaMemcpyAsync is waiting ?

Dear All,

I am doing RDMA data (images) transfer to CPU pinned memory then cudaMemcpyAsync to GPU
I can successfully transfer 1000 images and more but sometimes, randomly there are issues.
nvvp shows strange unexpected cudaMemcpyAsync lock and even overlapping Memcpy(htoD). Is this artefact (huge log file) or I am missing something ?

Is there any tool to inspect what happennig in driver/gpu ?

see the nnvp images PNG below