CUDA to VBO transfer problems

Windows XP, quad core, quadro fx 5600
Using cuda 0.8 driver 97.73 sdk 10


I am trying to transfer a CUDA array to a VBO .

I create the VBO vboid and do the following to allocate sz bytes:

   void *p = NULL;               
   glBindBufferARB ( GL_ARRAY_BUFFER_ARB, vboid );
   glBindBufferARB ( GL_ARRAY_BUFFER_ARB, 0 );

Here, I want p, the data the VBO will be initialized with, to be NULL. However, when I try to register the vbo using cudaGLRegisterBufferObject, it crashes.

If I allocate a dummy array for p of size sz and pass it in, the register and the subsequent cudaMemcpy works perfectly. However, I don’t want to have to initialize it, since I am just going to trample it anyway!

Or is there a better way of doing this?


It is a known bug (see

The workaround, is as you found out , to allocate a dummy array to pass to glBufferData.

Is there something special that needs to be done to sync for the end of the transfer?

As a test I have a VBO that contains positions of an object I am drawing. I xfer it to CUDA memory and then xfer it back to a VBO and then draw using the VBO.

I am registering, mapping, copying, unmapping and then unregistering.

The effect I get is on the first render I get some garbage – it seems to be a stale version of the data.

If I then redraw at a later time, using the same VBO, no extra copies – it gives me the correct result.

Is this a known issue? Am I supposed to call some sort of sync function?


I rejigged my code to reduce the impact of this workaround and as a result smacked into another problem, which I am now working around again.

What I wanted to do was:

void CalculateSomethingSpectacular( unsigned int outputvboid )


      void * p = MapVBOToCUDA( outputvboid );

     CalculateKernel( p );

     UnmapVBOFromCUDA( outputvboid );


However, if I did this, future cudaMallocs etc would return cudaError 10201 …

So instead, I need to do the compute into a temporary buffer and then copy it in

void CalculateSomethingSpectacularButSlower( unsigned int outputvboid, void *pTempDeviceBuffer )


      CalculateKernel( pTempDeviceBuffer );

     void * p = MapVBOToCUDA( outputvboid );

      cudaMemcpy( p, pTempDeviceBuffer, sz, devicetodevice );

      UnmapVBOFromCUDA( outputvboid );


OK, this works, but uses extra memory, requires temporary buffer management, and has an extra memcpy. Yeah it is blindingly fast since it is on the card but it is still unnecessary.

(!) As a side note, I imagine that cudaGLMapBufferObject actually copies the VBO contents to CUDA memory as part of the mapping. It would be nice if the function took an optional parameter “in_bPreserveOriginalData” that could be set to “false” when we don’t really care about the original contents. A post-blur effect would be able to set it to true, so the copy is done, and a kernel that purely generates data, overwriting everything, would just set it to false.

Similarly, the unmap should have such a parameter, so that two things can be done: 1) the caller can signal that although the data was mapped, it never changed and hence does not need to be copied back, and 2) if something failed, there is no point copying bogus data back.

Or is this latter optimization done automatically using some lower level mechanism we don’t see?

Thanks for any comments in advance,