I have a kernel application that reaches 79% of floating point peak on GF100/GF110 (91% of instruction peak including load/store and synchronization). I have just started to test the kernel on a GTX 560, and I have disappointingly found that the fraction of peak flops has decreased to 57%. Naively I was expecting it to go up since GF114 is a superscalar architecture and can overlap the execution of the load/store pipeline and the floating point pipeline.
My kernel uses all available 63 registers per thread, for a 33% occupancy, and 8 thread blocks per SM. This appears to work well on GF100/GF110 but doesn’t do so well on GF114. Does GF114 have additional restrictions with respect to occupancy requirements to hide latency?