I wonder why a way to initialize vbo make a big difference in fps when interacting with cuda. When I create vbo there are two possibilities:
vbo reserves only memory space with the given data size (in this case positions of the particles are first time write to vbo within the kernel and later modify in the kernel):
vbo reserves memory space with the given data size and get some initial data (positions of the particles - ofcourse these values are later modify in the kernel):
I believe it is because a GL buffer gets allocated in a lazy-fashion. Its buffer memory is not allocated until it is actually accessed.
If I understand your syntax correctly, in the first case, where you get 408fps, it might be CUDA that allocates the memory as soon as you map it to CUDA.
In the second case, it is OpenGL that allocates the memory.
Obviously it seems to make a difference whether OpenGL or CUDA allocates the memory. Maybe it needs to manually copy the data between the contexts? Try to vary the size (e.g. the number of particles) and see if that affects the mapping costs.