my cuda program need display video in realtime,I wrote below code to transfer video data from host to device(note : I use “nppiMalloc_8u_C4()” to allocate device memory for I using npp library to do some image processing):
// m_p422OriginHMem is a Page-Locked host memory block allocated by “cudaHostAllocate()” with “cudaHostAllocWriteCombined” flag;
ReadVideoData(m_p422OriginHMem);
// m_devPtr is a device memory block allocated by “nppiMalloc_8u_C4()”
cuRet = cudaMemcpy(m_devPtr,m_p422OriginHMem,videoMemSize,cudaMemcpyHostToDevice);
CU_ERRORCHECK(cuRet);
I do some measure to check the performance of above code,but I found it runs slower than the code I wrote previously,the former code using double PBOs to transfer video to a OpenGL texture,looks like below:
// use Display PBO to upload video data stored in PBO
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_EXT,m_pboVideoInput[m_iDisplayPbo]);
glBindTexture(GL_TEXTURE_2D,texFore);
glTexSubImage2D(GL_TEXTURE_2D,0,0,0,vf.Width/2,vf.Height,GL_BGRA_EXT,GL_UNSIGNED_BYTE,NULL);
// use Transfer PBO to copy video data to PBO
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_EXT,m_pboVideoInput[m_iTransferPbo]);
BYTE *pDmaAddr = static_cast(glMapBufferARB(GL_PIXEL_UNPACK_BUFFER_EXT,GL_WRITE_ONLY));
ReadVideoData(pDmaAddr);
glUnmapBufferARB(GL_PIXEL_UNPACK_BUFFER_EXT);
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_EXT,0);
GL_ERRORCHECK;
// swap Display & Transfer pbo,so the current copied video data can be displayed in next time
std::swap(m_iTransferPbo,m_iDisplayPbo);
above code according to a NVIDIA’s optimization doc,it uses two PBOs,one for display,the other for transfer,and swap one another in each render,then it can catch the best performance of asynchronous IO,I also modify my CUDA code to using two Page-Locked host memory just like the code using PBO,but there is no improvement;
How can I get the highest transfer speed under CUDA circumstance?
My display card is GTX470,with 306.94 driver;