I have a compute bound kernel running on a GTX560Ti with it’s 8 SMs. Would an
upgrade to a GTX570 with it’s 15 SMs see an appropriate speed up the kernel
execution speed even though the GFLOPS(FMA) speeds are roughly similar?
The kernel is unavoidably heavy on float arithmetic (__expf() for example).
This is difficult to say in general due to the different architectures. The GTX 560 Ti (compute capability 2.1) depends on instruction-level parallelism to feed its 48 “cores” (FPUs) from just 32 threads, while the GTX 570 (CC 2.0) has only 32 cores/FPUs per SM and thus runs better if only thread-level parallelism is available. So, all other things equal, a compute capability 2.0 device run CUDA code anywhere between 0% and 50% faster than a compute capability 2.1 device with the same number of cores. It really depends on the specific code.
If you want to try out your code on a compute capability 2.0 device, you might use a GPU instance on Amazon’s EC2.
I have a compute capability 2.0 device (A GTX 460) which has the same number of SMs as my 560. It’s speed is slower by approximately the ratios of the GFLOPS(FMA) values for the 460 and 560.
I therefore suspect the different compute capabilities won’t make any difference, as the kernel is tied up the floating point units.