VBOs don't improve performance? What am I doing wrong?

Hi:

When I enable VBO, render time goes down 65x (from 2.0ms to 0.03ms), however kernel execution time increases accordingly (from 16ms to 17ms) so no overall gain is noticed. This happens on a 8800GT. On some low level cards (8400GS and 8600M GT) performance penalty is even worst.

Is this to be expected? It seems that
cudaGLMapBufferObject
and
cudaGLUnmapBufferObject
have a worst performance than one
cudaMemcpy.

Here is the general algorithm:

No VBO:

  1. Initialize vertice array on the host.
  2. Transfer array to device with cudaMemcpy.
  3. Run the kernel on it, modifying array values.
  4. Read back modified array from device to host with cudaMemcpy.
  5. Use glVertexPointer & glDrawArrays to draw the vertex array on screen.
  6. Go to 3.

Notice that the array is not transfered back to the device (we don’t go to 2), since it is not modified (only displayed) by the host between each kernel call.

With VBO:

  1. Initialize vertice array on the host.
  2. Tell OpenGL this is a VBO with glBufferData; register this VBO in CUDA with cudaGLRegisterBufferObject.
  3. Map the VBO with cudaGLMapBufferObject.
  4. Run the kernel on it, modifying the values.
  5. Unmap the VBO with cudaGLUnmapBufferObject.
  6. Use glVertexPointer & glDrawArrays to draw the VBO on screen.
  7. Go to 3.

So without VBO at each cycle I do one device-to-host cudaMemcpy and then OpenGL has to do a host-to-device transfer for displaying.

With VBO there should be none of this transfers between host to device at each cycle. However, kernel times get slightly worst, just enough so that there is no gain (and in some cases it is even worst).

This happens both on Linux and on MacOSX, with CUDA 1.1.

Is this an expected behaviour or am I doing something wrong?

Thanks,

Paulo