Decoding using ffmpeg + cuda post processing

Our application does video processing using cuda.

I am trying use h264_cuvid codec for decoding. When receiving a decoded frame I am using cuMemCpyAsync to initiate a device to device transfer, The source memory was allocated by h264_cuvid codec using internal ffmpeg cuda context, the destination memory was allocated by using a cuda context I created in my application.
The data seems to be transferred through the host instead of device to device transfer, attached the Nsight timeline report.
You can see that the 900kb memory was transferred using Context 3 to the host and then transferred using Context 2 to the device.

It seems that cuda supplies cuMemcpyPeerAsync to copy memory between different contexts, however I can’t find a way to get the internal cuvid context that was used to allocate the memory

How can I avoid this host transfer?

I have tested cuMemcpyPeerAsync and still the transfer takes place using an intermediate cpu buffer.
Is it not possible to copy data between 2 contexts directly?