I have just moved up from a 8800 GTX to a GTX 280, and besides adjustment of the execution configuration I was wondering what other parameters I should be tuning to take advantage of the new hardware? My application is fairly register intensive (unavoidably), and also requires transfer (host->device) of a certain amount of data between executions. I was therefore hoping that the new hardware would do wonders because of its higher register count and memory bandwidth, but in fact I see only about 20% improvement - hardly stunning.
Well, there are the new coalescing rules. Depending on your application’s memory access patter you could make use of those. And the new warp voting feature could potentially save you some cycles if you make any decisions per warp. Or if you have a divergent section of your code, the warp voting could be used to break the divergence.
Since you went from an 8800 GTX to a GTX 280, you also can do interleaved host->device copies and kernel executions using the Stream API. Given that you mention copying and then executing, this could potentially help you out a lot (assuming that your steps are iterative and need to be run in order without interleaving).
Page 54 of the programming guide you will find the information about the new coalescing rules.
They also say “compute capability 1.2 and higher”. I wonder if that means cc 1.2 is only that change in coalescing rules. Which would mean that indeed we could see a version with the gtx200 without the double precision fpu’s (ie cc 1.3)