Question regarding G104/GF114 Obtaining peak performance

Greetings,

I have a kernel application that reaches 79% of floating point peak on GF100/GF110 (91% of instruction peak including load/store and synchronization). I have just started to test the kernel on a GTX 560, and I have disappointingly found that the fraction of peak flops has decreased to 57%. Naively I was expecting it to go up since GF114 is a superscalar architecture and can overlap the execution of the load/store pipeline and the floating point pipeline.

My kernel uses all available 63 registers per thread, for a 33% occupancy, and 8 thread blocks per SM. This appears to work well on GF100/GF110 but doesn’t do so well on GF114. Does GF114 have additional restrictions with respect to occupancy requirements to hide latency?

For those that are interested, the paper describing this work is here and the source code is here.

Thanks.

Remember that this arch requires som ILP to exploit the extra set of SP:s, otherwise you will theoretically only reach 66%.

My kernel has bags of ILP: it consists of an unrolled loop, each iteration of which has 128 fma, 1 global load, 2 shared stores, 16 shared loads and one thread sync. The shared loads are completely independent and the fma instructions are divided into summations to 32 independent accumulators.

I’m guessing that ILP can’t fully be realized because of register pressure.