Slow readback from PBuffer to CUDA memory

I’m just wondering if anybody has had any luck transferring data from a 16-bit float PBuffer to CUDA memory via a PBO at fast speeds. If I use a 8-bit PBuffer, and 8-bit PBO data, I get pretty good speeds. I need to use a 16-bit float PBuffer, and 10-bit integer data (10_10_10_2 packing) in CUDA memory.

I’m using the technique as shown in the postProcessGL example program in the SDK but am not getting good speeds if I use anything other than 8-bit packing.

Are there any faster methods to read back the data from a PBuffer (16-bit float).


Still using PBuffers? I’d recommend using FBOs with attached textures or renderbuffers.


Can you profile the kernel (using the profiler supplied in the CUDA SDK)? I haven’t really done too much with OpenGL but my off-the-cuff guess is that reading the 10-bit integers is causing extra memory reads somewhere.

Actually I have completely disabled the kernel now. The issue seems to be just the glReadPixels to read back from the 16-bit float PBuffer to 10-bit OpenGL PBO is causing the problem. If I use a 8-bit PBuffer and a 8-bit PBO I get excellent speeds.

Is there a better method to get screen/off-screen rendered data using OpenGL back to CUDA for processing?

Hmmm I can switch to using FBOs, but it seems I still need to use glReadPixels to go back to CUDA via a PBO. Or is there a better way?

I’ll try this but I suspect I’ll run into the same when I try to read non 8-bit data.

Check the last post here :…hl=glReadPixels

glReadPixels is apparently very slow.

That’s a weird post in your link. The poster suggests using glTexSubImage instead of glReadPixels, but glTexSubImage is used to transfer data from a PBO to a texture, not for transferring from a framebuffer/FBO to a PBO.

Maybe he meant to say glGetTexImage, but this seems very unlikely, as you would need a PIXEL_PACK_BUFFER instead of a PIXEL_UNPACK_BUFFER.

I used







without problems


Ok I made some progress with this.

This is what I am doing now:

  1. Render to a 16-bit half float PBuffer/FBO
  2. glReadPixels as GL_HALF_FLOAT_NV in a RGBA format back to a PBO (this is quite fast since the format of the PBO and FBO match)
  3. Map to CUDA device memory using CUDA 2.2 (this is quite fast as well)

Where I am stuck is that I am unable to read the half float array in CUDA as CUDA doesn’t seem to support 16-bit floats.

Any ideas anyone?

You could wait for CUDA 2.3 to be released. It has new support for fp16 <-> conversion intrinsics which allows storage of data in fp16 format with computation in fp32, or use the Driver API which supports fp16 array formats.


Yup I tried the CUDA 2.3 beta, and that does indeed solve my problem. Excellent - thanks everyone for your suggestions and help.