TensortRT Why is execute slow when CudaHostRegister is used

Hello.

I am usign TRT for classification MNIST images. I build a network for 256x256x3 input. (The network is working and gives correct prediction. But this is not about the network, it is about inference.

My workflow is : Get Image → (Only once)allocate GPU (cudaMallocManaged) → Preprocess (coping of the image etc.) → Copy to GPU (cudaMemcpy) → run execute on TRTExecutionContext.

The time for this execute is (please note that only _trtExecutionContext->execute(1, aInputOutputBuffer.data()) is measured):

Total time of execution was: 25170.8ms Time per image was: 1.67828ms FPS: 595.85

Now I tried to optimize that workflow so I would not allocate memory for every graphic operation and avoid copying this memory to gpu. So I did this:

My workflow is: Get Image into host allocated memory → Preprocess this image without copying → (Only once) make host allocated memory directly accessible from GPU (cudaHostRegister)) → run execute on TRTExecutionContext.

( Using this cudaHostRegister(buffer, size, cudaHostRegisterMapped) );

The predictions are same. GPU is getting right image etc.

But the time is (please note that only _trtExecutionContext->execute(1, aInputOutputBuffer.data()) is measured):

Total time of execution was: 63787.7ms Time per image was: 4.25308ms FPS: 235.124

Why is the time of the execution without copying memory to GPU higher than the first case? Is there any overhead by pci-ex communication?

Color FullHD image times:

First Workflow Execution:
Total time of execution was: 87492.6ms Time per image was: 5.83362ms FPS: 171.42

Second Workflow Execution:
Total time of execution was: 64990.3ms Time per image was: 4.33327ms FPS: 230.773

Times for 1024*512 color image:

First Workflow Execution:
Total time of execution was: 41676.9ms Time per image was: 2.77883ms FPS: 359.864

Second Workflow Execution:
Total time of execution was: 63770.6ms Time per image was: 4.25194ms FPS: 235.187

So apparently there is some overhead by using DMA on small chunks of memory. Is there a way how to calculate this overhead? I never worked with DMA so I would appreciate any guidance.

Thank you.

Edit:

I measured 2560 * 1440 * 3 image as well. I put all results in graph:

So i think that DMA has some constant overhead and it should not be used when working with data smaller than ??? bytes. Or am I missing some point? Is there somethign I do not know about which could improve speed on smaller images as well?

Edit #2:

I calculated that cudaHostRegister is same or faster when working with 15361200 or more bytes. Otherwise is slow. Is there a was how to improve it?

Moving to CUDA Programming and Performance forum.