OpenGL & CUDA interop with surfaces slow...

Hey guys, I have a question regarding the performance of my code.
Initially I created something like this:

1. create GL texture

in a loop:
{
2. cudaMemCpy cached texture data from CPU to GPU
3. process the buffer in the kernel on the GPU
4. cudaMemCpy the pixel to the host
5. glTexSubImage the buffer the update the texture
}

And now I tried to use surfaces to avoid copying the buffers back and forth

1. create GL texture
2. use CUDA GL interop to map the texture data and create surfaces

in a loop
{
3. map the buffer to cuda
4. run the kernel
5. unmap
}

And I got surprised to see that the second solution is 7 times slower
Can it be slower? Or I’m doing something wrong?
The kernel is the same, except writing to the buffers directly I use surf2Dread & surf2Dwrite.

Is it really necessary to have the map/unmap steps within the loop? How many loop iterations are being run?

Have you timed the individual CUDA API calls with the visual profiler or similar tools?

When I remove map/unmap, the textures are not updated.
Te kernel is called every frame for the infinite loop animation (until the app is closed).
I will use the profiler to try to understand more, thanks for the suggestion.