Hi All,
I have the code with following implementation
- Allocate memory during initialization( Host)
- Fill data into allocated buffer (Host)
- Transfer data to device and kernel operation
- use Host buffer on the CPU side
I used two different hardware for the same code.
Following is the response
Device : Quadro M1000M with 5.0 capability
I implemented Pinned Memory(Page-Locked) and mapped memory methods. Timing difference is negligible.
- Allocate memory during initialization( Host) - only once - So didn’t profile this part
- Fill data into allocated buffer (Host) - 1ms( 1024 x 640 buffer)
- Transfer data to device and kernel operation - 1ms for transfer
- use Host buffer on the CPU side - 3 ms
Device : Drive Px2(dGPU) with 6.1 capability
I implemented Pinned Memory(Page-Locked) and mapped memory methods. Timing difference is negligible.
- Allocate memory during initialization( Host) - only once - So didn’t profile this part
- Fill data into allocated buffer (Host) - 3ms( 1024 x 640 buffer)
- Transfer data to device and kernel operation - 1ms for transfer
- use Host buffer on the CPU side - 54 ms
same code for the pinned memory increases overall CPU run time drastically.
Note: CUDA device flag is set during initialization for mapped memory
Please let me know the missing part of my implementation.
Thanks,