Pinned memory implementation Increases Overall Execution time Drastically - Drive Px2

krishnan.purusothaman · February 28, 2018, 9:38am

Hi All,

I have the code with following implementation

Allocate memory during initialization( Host)
Fill data into allocated buffer (Host)
Transfer data to device and kernel operation
use Host buffer on the CPU side

I used two different hardware for the same code.
Following is the response

Device : Quadro M1000M with 5.0 capability

I implemented Pinned Memory(Page-Locked) and mapped memory methods. Timing difference is negligible.

Allocate memory during initialization( Host) - only once - So didn’t profile this part
Fill data into allocated buffer (Host) - 1ms( 1024 x 640 buffer)
Transfer data to device and kernel operation - 1ms for transfer
use Host buffer on the CPU side - 3 ms

Device : Drive Px2(dGPU) with 6.1 capability

I implemented Pinned Memory(Page-Locked) and mapped memory methods. Timing difference is negligible.

Allocate memory during initialization( Host) - only once - So didn’t profile this part
Fill data into allocated buffer (Host) - 3ms( 1024 x 640 buffer)
Transfer data to device and kernel operation - 1ms for transfer
use Host buffer on the CPU side - 54 ms

same code for the pinned memory increases overall CPU run time drastically.

Note: CUDA device flag is set during initialization for mapped memory

Please let me know the missing part of my implementation.

SivaRamaKrishnaNV · February 28, 2018, 12:14pm

Dear krishnan.purusotama,
Pinned memory is not cached on CPU on Tegra devices with no IO coherence. So you would notice a drastic increase in CPU time.

krishnan.purusothaman · February 28, 2018, 12:19pm

Thanks for the reply. This means we can’t manipulate the pinned or page locked buffers on CPU side. is there any way to enable Cache? like compiler option ?

SivaRamaKrishnaNV · February 28, 2018, 1:03pm

Dear krishnan.purusotama,
The pinned memory cache is disabled to give coherent view of data on both CPU and GPU. It is a design decision. The next version of Tegra( xavier) has cached pinned memory as it has IO coherence.
In your current use case, you can use unified memory

krishnan.purusothaman · February 28, 2018, 1:18pm

Hi SivaramaKrishna,

Thanks for the reply. is there any other way to improve the memory transfer throughput?.

I am using cudamemcpy2dAsync for filling the ROI of destination.
Also i tried using cudaMemcpyAsync instead of cudamemcpy2dAsync by increasing the source buffer size same as destination. When i profiled the code, no improvement in performance(Time). is this behavior expected?

Thanks,

SivaRamaKrishnaNV · February 28, 2018, 2:16pm

Dear krishnan.purusotama,
If you use iGPU, you can try using unified memory. If you are using dGPU, you can reduce the overall time by overlapping computation with data transfer time using streams. Please refer to Programming Guide :: CUDA Toolkit Documentation