Hey guys, I have a question regarding the performance of my code.
Initially I created something like this:
1. create GL texture
in a loop:
{
2. cudaMemCpy cached texture data from CPU to GPU
3. process the buffer in the kernel on the GPU
4. cudaMemCpy the pixel to the host
5. glTexSubImage the buffer the update the texture
}
And now I tried to use surfaces to avoid copying the buffers back and forth
1. create GL texture
2. use CUDA GL interop to map the texture data and create surfaces
in a loop
{
3. map the buffer to cuda
4. run the kernel
5. unmap
}
And I got surprised to see that the second solution is 7 times slower…
Can it be slower? Or I’m doing something wrong?
The kernel is the same, except writing to the buffers directly I use surf2Dread & surf2Dwrite.