I guess this is one of the most popular topics on this forum: “I have code running on my GPU, how do I make it run faster ?”.
So here I am, with my piece of code, and I want to make it run faster. I have tweaked the code for two weeks now, tested at least 10 different configurations, and my current code runs ~2.5 times faster than the initial version.
I use templates, pointer increments, shared memory, fully coalesced memory access, streams, and some unrolling. I have tried every trick I could find out there, but this is my first large CUDA code (more than 1k lines in ~10 files), so I’m still in the learning curve.
To make this post less boring my question comes with a twist: my code is running on a 9600GT GPU, doing the whole processing in 0.15 seconds. When I make it run on a Tesla T10 it runs in 0.13 seconds. How comes ?
The Tesla GPU has ~2.5 times the computing power of the 9600GT GPU, yet the speed improvement is almost none.
Looking into the profiler output things become more interesting. 55% of the total computation time of my code is dedicated to one single function called 8 times, the rest is spread in multiple other functiones.
I have attached to this post screenshots of the profiler output of this particular function when running on each GPU model.
As you can see, on the 9600GT the kernel runtime is bound by computing resources, 80% of the kernel is spent computing, this seems reasonable to me. When moving to the Tesla I would then expect a considerable speed up, since it provides more computing power.
However when running on the Tesla T10 the relation changes and the kernel becomes memory bounded. What took less than 5% of the time on the 9600GT takes 60% on the Tesla T10.
Even worse: as written the code does as many reads than writes on the mentioned kernel. Yet on the Tesla T10 the profiler reports 5% of time reading, but 60% of the time writting.
So my question is: how can this be ? I am using the profiler incorrectly ? what can be happening that would explain these numbers ?
Thank you very much for your answers and hints. If you need any more details or specific benchmarks, I will provide them.