Hello, recently I have worked on GPU for application acceleration. Now I am perplexed about the low performance for CudaMallocHost.
I have many buffers *As and *Bs, which are all around 100MB. For each buffer, I need to copy some data from A to B, and they may also be copied to GPU when the application needs, so these buffers are created with CudaMallocHost for high transfer bandwidth and asynchronous transfer.
But the copy rate from A to B is low. When I create the buffer with C++ “new” operation without other change, the copy rate can increase from 4GB/s to around 6GB/s. I am just confused that why CudaMallocHost will hurt the access performance by CPU. Is it just page-locked for no replacement by the operating system? Has anyone met the same problem? Thanks!