OK so I’ve got an algorithm that I’m working on, and I’ve previously implemented it using openGL/GLSL. The basic idea is this.
I have two off screen buffers, one is read only and one is write-only. I take in 5 floats as an input form the user, and, along with a (constant) input texture, I do some calculations involving a distance calculation, some sin and cos looks up, etc, I then sum the result with the old result from my input buffer (a texture). Then I return the result from the fragment shader, which is attached to the write buffer and so that’s where it gets stored. Then I swap the input/output buffers and calculate a new frame.
Doing it this way in openGL is very very quick, and I was pleasantly surprised at the results. However, now I’m doing the same algorithm in CUDA and I’m much chagrined to find that it’s running at < 1/10th the speed of the openGL implementation, and I’d like to figure out why. I’ve been through the performance guidlines and the only thing I can think of that would be responsible for the slow down is the fact that my (now single) buffer is in global memory in CUDA, and in texture memory in openGL. I did this because, as far as I could tell, there was no way to write back to texture memory in CUDA, which I need to be able to do. Does anyone have any suggestions on how I might speed this up more?