I’m just wondering if anybody has had any luck transferring data from a 16-bit float PBuffer to CUDA memory via a PBO at fast speeds. If I use a 8-bit PBuffer, and 8-bit PBO data, I get pretty good speeds. I need to use a 16-bit float PBuffer, and 10-bit integer data (10_10_10_2 packing) in CUDA memory.
I’m using the technique as shown in the postProcessGL example program in the SDK but am not getting good speeds if I use anything other than 8-bit packing.
Are there any faster methods to read back the data from a PBuffer (16-bit float).
Can you profile the kernel (using the profiler supplied in the CUDA SDK)? I haven’t really done too much with OpenGL but my off-the-cuff guess is that reading the 10-bit integers is causing extra memory reads somewhere.
Actually I have completely disabled the kernel now. The issue seems to be just the glReadPixels to read back from the 16-bit float PBuffer to 10-bit OpenGL PBO is causing the problem. If I use a 8-bit PBuffer and a 8-bit PBO I get excellent speeds.
Is there a better method to get screen/off-screen rendered data using OpenGL back to CUDA for processing?
That’s a weird post in your link. The poster suggests using glTexSubImage instead of glReadPixels, but glTexSubImage is used to transfer data from a PBO to a texture, not for transferring from a framebuffer/FBO to a PBO.
Maybe he meant to say glGetTexImage, but this seems very unlikely, as you would need a PIXEL_PACK_BUFFER instead of a PIXEL_UNPACK_BUFFER.
You could wait for CUDA 2.3 to be released. It has new support for fp16 <-> conversion intrinsics which allows storage of data in fp16 format with computation in fp32, or use the Driver API which supports fp16 array formats.