cudaMemCopy vs glReadPixel Time Performance

Hi, I am a starter with CUDA. I know GLSL programming and find CUDA to be little different.

I know that the great bottleneck associated with GPU programming using GLSL was expensive time spent on data transfer between CPU and GPU.There are few ways I know to achieve the same using glReadPixel() , glDrawPixel … glCopy… etc .

In CUDA we have cudamemcpy() with an option to change direction of flow.

My question: [Perhaps being asked earlier]: Is data transfer on CUDA expensive than what is there in GLSL for the above functions ? If not then why cudaMemcopy fares better that glReadPixelBuffer() in normal OGL programming ?

P.S : I am not exact in writing functions names but want to post my concern only.

The CUDA memory transfer operations are very well optimized and can get close to the theoretical maximum bandwidth for the PCI-express bus.

The OpenGL pixel transfer functions are a lot more flexible (there are lots of re-formatting options), but if you use the pixel buffer object extension (which essentially gives you pinned memory), they can get close to the CUDA peformance.

If you want to compare CUDA and OpenGL performance you should make sure that the components do not get swizzled in OpenGL when transferring them to CPU.

In other words, when using GL_UNSIGNED_BYTE 4 component format, you’re better off using GL_BGRA instead of GL_RGBA.
For floating point buffers, use GL_RGBA.