Performance when using decode and encode together

I am facing odd behavior, I am running two processes one do decode and one do encode, it causing to decode to take a lot of time (when I am not using encode and only decode it’s take around 7ms but when they run together it can take 30ms),
to add some interesting stuff… i running this process on serval machines and some behave like i said but some have good behaviors (decode and encode run together without any problems).
Did you have some explaining for that?

Edited: I do some research and the most of the time spend on copy from host to device (function cuMemcpy2DAsync), do you have suggestions how to improve the performances?

Use only hardware platforms with the fastest interconnect currently available, which is at present PCIe gen 4 x16. If you can control the partitioning of data transfers, transfer data in as few transfers as possible, as a transfer across PCIe has a fixed overhead component. In other words, sending 8 times 256 KB is going to be less efficient than sending 2 times 1 MB.

Copy from pinned host memory (and do the pinning beforehand).

Compare the achieved data rate to the theoretical maximum rate of PCIe. (If you are far below, something was not configured/optimized correctly.)