Looking at the Cuda manual, it appears that the Kepler architecture has 4 schedulers (per SMX) that can each issue two instructions from a warp each cycle. Assuming a mix of instruction types, is it possible to issue 2 instructions from each of the 4 warps, for a total of 256 operations, under any ideal situation? How independent are the various execution pipelines? It looks like there’s basically a crossbar with the schedulers on one side and a number of different types of execution units on the other (6x32 float, 1x8 double, 6x32 int add, 5x32 int compare, 1x32 int shift, 5x32 logic, 1x32 transcendential, 4x32 int conversion, 1x8 64-bit conversion, 1x32 other conversion). Is this a more or less correct way of looking at it, and in what ways can these execution units be used together? Can they be arbitrarily mixed and matched between the various schedulers, or do some or all of them share resources that prevent them from working in parallel?
In the programming guide, it says before Kelper, double precision operations can’t be mixed with other instructions but now they can. External Image