Drive PX 2: Improve the performance of cudamemcpy HtoD

Hi All,
Currently I am stuck up with a memory transfer - performance bottleneck between host to device in a compute vision(CV) algorithm.

My scenario:
The CV algorithm works on a single channel image of double data type with resolution of 1152x640. A considerable amount of algorithm is moved to CUDA part. The kernel performances are satisfactory, but the memory transfer of image (2D double array) from host to device for kernel operation is the second biggest hotspot in the algorithm for optimization. The trimmed nvprof profile report is as follows:

Start - Duration - Size - Throughput - Device - Context - Stream - Name
777.36ms - 5.5329ms - 5.1534MB - 931.41MB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]
778.48ms - 5.4373ms - 5.0000MB - 919.57MB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
865.49ms - 3.2149ms - 5.0000MB - 1.5188GB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
886.60ms - 3.3863ms - 5.1534MB - 1.4862GB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]
1.11558s - 3.2151ms - 5.0000MB - 1.5187GB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
1.13646s - 3.3834ms - 5.1534MB - 1.4875GB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]
1.27230s - 3.2139ms - 5.0000MB - 1.5193GB/s - GP106 (0) - 1 - 17 - [CUDA memcpy HtoD]
1.29281s - 3.3815ms - 5.1534MB - 1.4883GB/s - GP106 (0) - 1 - 14 - [CUDA memcpy HtoD]

I am copying around 5MB of data from HtoD for every frame processing, at a bandwidth of about 1.5GB/s.

Hardware details:
Board Series : Drive PX 2
Board Configuration : AutoChauffeur
GPU type used for CUDA : Discrete GPU
Device used for CUDA : GP106 (id: 0)

Code Implementation details:
Device memory allocation : cudaMalloc()
Host memory allocation : cudaMallocHost()
copy mechanism : cudaMemcpy2DAsync() with non-default stream

I would like to get some insights to improve the bandwidth and the performance of copy.
Thanks in advance.