Hi, can I somehow check if there’s still speed increase possible for my CUDA app?
Is there any tool like a profiler or something like this?
I mean I read the CUDA programming guide and they tell a lot about memory collisions,
memory latency and so on.
For example, I could program anything and don’t notice that there are around 600 cycles
memory latency because I used any bad memory accessing order (global, shared etc.).
there is a profiler available, just look in the announcements section.
you can also do some rough calculations, i.e. calculate, how many GB your kernel will transfer within device global memory and look if it gets to a bandwidth near the max. bandwidth of the card.
if not, you either have long computation times (check the performance guide of the dev. manual for ways to optimize this) or something is not coalesced. (the profiler will tell you the latter)
check, if all your transfers within the device are needed or if you could save some of them by using shared/texture/const mem.
check, if you have reduced the device-host memory transfers to a minimum.
in short: find the bottleneck, remove it, begin with searching again. ;-)
the cuda occupancy calculator and the visual profiler are all available for free, either here in the forums or at the cudaZone at nividia.com/cuda.
also take a look at the tutorial section of the cudaZone, there you’ll find some nice examples of how to optimize a kernel.
On XP, it will tell you the number of conflicts, such as bank conflicts or divergence. On pre-G200 hardware it would tell you the number of uncoalesced accesses, which is probably the most important slowdown.