CUDA access to VBO slow?

Hello,

So I wrote a test case to compare accessing a vertex buffer object from the application layer VS accessing it directly from cuda. In both cases it’s just a flat plain that I “+= 0.05” to the Y value of each vertex before every draw call. In both cases I got about the same FPS which is surprising since the CUDA implementation doesn’t have to upload the values or do a for loop (it does a thread per vertex).

Is this expected, or am I missing out on something? I do map and unmap during each call and register during initialisation and unregister at closing? Do I have to do anything to the pointer at all after calling cudaGLMapBufferObject()?

Without touching the VBO: ~200fps
Changing the values from the CPU: ~100fps
Accessing VBO from CUDA via cudaGLMapBufferObject(): ~100fps

Also FYI:
Available Cuda Devices (2):
CUDA DEVICE: 00
DEVICE NAME: GeForce GTX 460
DEVICE CLOCK: 1401000kHz
MAX BLOCK DIM: 65535x65535x65535
MAX THREAD DIM: 1024x1024x64

CUDA DEVICE: 01
DEVICE NAME: GeForce GTX 460
DEVICE CLOCK: 1401000kHz
MAX BLOCK DIM: 65535x65535x65535
MAX THREAD DIM: 1024x1024x64