Pinned memory implementation Increases Overall Execution time Drastically - Drive Px2

Hi All,

I have the code with following implementation

  1. Allocate memory during initialization( Host)
  2. Fill data into allocated buffer (Host)
  3. Transfer data to device and kernel operation
  4. use Host buffer on the CPU side

I used two different hardware for the same code.
Following is the response

Device : Quadro M1000M with 5.0 capability

I implemented Pinned Memory(Page-Locked) and mapped memory methods. Timing difference is negligible.

  1. Allocate memory during initialization( Host) - only once - So didn’t profile this part
  2. Fill data into allocated buffer (Host) - 1ms( 1024 x 640 buffer)
  3. Transfer data to device and kernel operation - 1ms for transfer
  4. use Host buffer on the CPU side - 3 ms

Device : Drive Px2(dGPU) with 6.1 capability

I implemented Pinned Memory(Page-Locked) and mapped memory methods. Timing difference is negligible.

  1. Allocate memory during initialization( Host) - only once - So didn’t profile this part
  2. Fill data into allocated buffer (Host) - 3ms( 1024 x 640 buffer)
  3. Transfer data to device and kernel operation - 1ms for transfer
  4. use Host buffer on the CPU side - 54 ms

same code for the pinned memory increases overall CPU run time drastically.

Note: CUDA device flag is set during initialization for mapped memory

Please let me know the missing part of my implementation.

Dear krishnan.purusotama,
Pinned memory is not cached on CPU on Tegra devices with no IO coherence. So you would notice a drastic increase in CPU time.

Thanks for the reply. This means we can’t manipulate the pinned or page locked buffers on CPU side. is there any way to enable Cache? like compiler option ?

Dear krishnan.purusotama,
The pinned memory cache is disabled to give coherent view of data on both CPU and GPU. It is a design decision. The next version of Tegra( xavier) has cached pinned memory as it has IO coherence.
In your current use case, you can use unified memory

Hi SivaramaKrishna,

Thanks for the reply. is there any other way to improve the memory transfer throughput?.

I am using cudamemcpy2dAsync for filling the ROI of destination.
Also i tried using cudaMemcpyAsync instead of cudamemcpy2dAsync by increasing the source buffer size same as destination. When i profiled the code, no improvement in performance(Time). is this behavior expected?

Thanks,

Dear krishnan.purusotama,
If you use iGPU, you can try using unified memory. If you are using dGPU, you can reduce the overall time by overlapping computation with data transfer time using streams. Please refer to Programming Guide :: CUDA Toolkit Documentation