OpenGL Interoperability Latency, why?

I’m comparing the simpleGL project using VBO (Vertex Buffer Object) against Vertex Array (copying the mesh from the host to device, calculating the new mesh and copy back from device to host), I see great timings for VBO with a big mesh… the problems is why the VA is more efficient when the mesh is small… I run 256 x 256 vertex and VA takes less than 1ms and VBO takes about 16ms, I see that the cudaGLMapBufferObject() and cudaGLUnmapBufferObject() has some issues with timing… does anybody has something to say about this? thanks.