I have a compute bound kernel running on a GTX560Ti with it’s 8 SMs. Would an
upgrade to a GTX570 with it’s 15 SMs see an appropriate speed up the kernel
execution speed even though the GFLOPS(FMA) speeds are roughly similar?
The kernel is unavoidably heavy on float arithmetic (__expf() for example).
This is difficult to say in general due to the different architectures. The GTX 560 Ti (compute capability 2.1) depends on instruction-level parallelism to feed its 48 “cores” (FPUs) from just 32 threads, while the GTX 570 (CC 2.0) has only 32 cores/FPUs per SM and thus runs better if only thread-level parallelism is available. So, all other things equal, a compute capability 2.0 device run CUDA code anywhere between 0% and 50% faster than a compute capability 2.1 device with the same number of cores. It really depends on the specific code.
If you want to try out your code on a compute capability 2.0 device, you might use a GPU instance on Amazon’s EC2.