seek the method to transfer data from host to device with highest speed

my cuda program need display video in realtime,I wrote below code to transfer video data from host to device(note : I use “nppiMalloc_8u_C4()” to allocate device memory for I using npp library to do some image processing):

// m_p422OriginHMem is a Page-Locked host memory block allocated by “cudaHostAllocate()” with “cudaHostAllocWriteCombined” flag;
ReadVideoData(m_p422OriginHMem);
// m_devPtr is a device memory block allocated by “nppiMalloc_8u_C4()”
cuRet = cudaMemcpy(m_devPtr,m_p422OriginHMem,videoMemSize,cudaMemcpyHostToDevice);
CU_ERRORCHECK(cuRet);

I do some measure to check the performance of above code,but I found it runs slower than the code I wrote previously,the former code using double PBOs to transfer video to a OpenGL texture,looks like below:

// use Display PBO to upload video data stored in PBO
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_EXT,m_pboVideoInput[m_iDisplayPbo]);
glBindTexture(GL_TEXTURE_2D,texFore);
glTexSubImage2D(GL_TEXTURE_2D,0,0,0,vf.Width/2,vf.Height,GL_BGRA_EXT,GL_UNSIGNED_BYTE,NULL);
// use Transfer PBO to copy video data to PBO
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_EXT,m_pboVideoInput[m_iTransferPbo]);
BYTE *pDmaAddr = static_cast(glMapBufferARB(GL_PIXEL_UNPACK_BUFFER_EXT,GL_WRITE_ONLY));
ReadVideoData(pDmaAddr);
glUnmapBufferARB(GL_PIXEL_UNPACK_BUFFER_EXT);
glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_EXT,0);
GL_ERRORCHECK;
// swap Display & Transfer pbo,so the current copied video data can be displayed in next time
std::swap(m_iTransferPbo,m_iDisplayPbo);

above code according to a NVIDIA’s optimization doc,it uses two PBOs,one for display,the other for transfer,and swap one another in each render,then it can catch the best performance of asynchronous IO,I also modify my CUDA code to using two Page-Locked host memory just like the code using PBO,but there is no improvement;

How can I get the highest transfer speed under CUDA circumstance?

My display card is GTX470,with 306.94 driver;

You would want to look at CUDA streams and asynchronous memory transfers via cudaMemcpyAsync() for fast copies across PCIe in either direction.

http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDART__MEMORY_g732efed5ab5cb184c920a21eb36e8ce4.html

cudaMemcpyAsync() requires pinned host memory, but it seems your app already uses that. cudaMemcpy() in most cases first copies user data into a driver-internal, pinned host memory buffer, then uses the GPU’s DMA engine to transfer the data from there to the device. With cudaMemcpyAsync() and pinned host memory as the source, data is transfered directly from user data to the device via DMA.

thanks,njuffa