Greetings,
I have a kernel application that reaches 79% of floating point peak on GF100/GF110 (91% of instruction peak including load/store and synchronization). I have just started to test the kernel on a GTX 560, and I have disappointingly found that the fraction of peak flops has decreased to 57%. Naively I was expecting it to go up since GF114 is a superscalar architecture and can overlap the execution of the load/store pipeline and the floating point pipeline.
My kernel uses all available 63 registers per thread, for a 33% occupancy, and 8 thread blocks per SM. This appears to work well on GF100/GF110 but doesn’t do so well on GF114. Does GF114 have additional restrictions with respect to occupancy requirements to hide latency?
For those that are interested, the paper describing this work is here and the source code is here.
Thanks.