Hello.
I am usign TRT for classification MNIST images. I build a network for 256x256x3 input. (The network is working and gives correct prediction. But this is not about the network, it is about inference.
My workflow is : Get Image → (Only once)allocate GPU (cudaMallocManaged) → Preprocess (coping of the image etc.) → Copy to GPU (cudaMemcpy) → run execute on TRTExecutionContext.
The time for this execute is (please note that only _trtExecutionContext->execute(1, aInputOutputBuffer.data()) is measured):
Total time of execution was: 25170.8ms Time per image was: 1.67828ms FPS: 595.85
Now I tried to optimize that workflow so I would not allocate memory for every graphic operation and avoid copying this memory to gpu. So I did this:
My workflow is: Get Image into host allocated memory → Preprocess this image without copying → (Only once) make host allocated memory directly accessible from GPU (cudaHostRegister)) → run execute on TRTExecutionContext.
( Using this cudaHostRegister(buffer, size, cudaHostRegisterMapped) );
The predictions are same. GPU is getting right image etc.
But the time is (please note that only _trtExecutionContext->execute(1, aInputOutputBuffer.data()) is measured):
Total time of execution was: 63787.7ms Time per image was: 4.25308ms FPS: 235.124
Why is the time of the execution without copying memory to GPU higher than the first case? Is there any overhead by pci-ex communication?
Color FullHD image times:
First Workflow Execution:
Total time of execution was: 87492.6ms Time per image was: 5.83362ms FPS: 171.42
Second Workflow Execution:
Total time of execution was: 64990.3ms Time per image was: 4.33327ms FPS: 230.773
Times for 1024*512 color image:
First Workflow Execution:
Total time of execution was: 41676.9ms Time per image was: 2.77883ms FPS: 359.864
Second Workflow Execution:
Total time of execution was: 63770.6ms Time per image was: 4.25194ms FPS: 235.187
So apparently there is some overhead by using DMA on small chunks of memory. Is there a way how to calculate this overhead? I never worked with DMA so I would appreciate any guidance.
Thank you.
Edit:
I measured 2560 * 1440 * 3 image as well. I put all results in graph:
So i think that DMA has some constant overhead and it should not be used when working with data smaller than ??? bytes. Or am I missing some point? Is there somethign I do not know about which could improve speed on smaller images as well?
Edit #2:
I calculated that cudaHostRegister is same or faster when working with 15361200 or more bytes. Otherwise is slow. Is there a was how to improve it?