there is a profiler available, just look in the announcements section.
you can also do some rough calculations, i.e. calculate, how many GB your kernel will transfer within device global memory and look if it gets to a bandwidth near the max. bandwidth of the card.
if not, you either have long computation times (check the performance guide of the dev. manual for ways to optimize this) or something is not coalesced. (the profiler will tell you the latter)
check, if all your transfers within the device are needed or if you could save some of them by using shared/texture/const mem.
check, if you have reduced the device-host memory transfers to a minimum.
in short: find the bottleneck, remove it, begin with searching again. ;-)
the cuda occupancy calculator and the visual profiler are all available for free, either here in the forums or at the cudaZone at nividia.com/cuda.
also take a look at the tutorial section of the cudaZone, there you’ll find some nice examples of how to optimize a kernel.
On XP, it will tell you the number of conflicts, such as bank conflicts or divergence. On pre-G200 hardware it would tell you the number of uncoalesced accesses, which is probably the most important slowdown.