how does fermi join two core for DP fermi, double precision instruction

papers and tech reports indicate that Fermi joins two cuda core for executing a double precision instruction. an SM features three sets of computing units, 1 set of 16 core , 1 set of 16 core, 1 set of 4 sfu,
then when an double precision floating point instruction is issued, which of the following condition happens ?
(a)16 core inside 1 set join together into 8DPU, and a warp is executed by this single set by 4 loops, namely 84=32.
(b)each core from set 1 joins with each core from set 2, makes 16DPU, and a warp is executed by these two sets by 2 loops, namely 16
©none of the above is right, we have other explanation.

according to “Inside Fermi Nvidia’s HPC Push”, 64bit-integer instruction is performed in way (a), and 64bit-floating point instruction is performed in way (b), but it is only a guess, not technique report, anyone who can make a solid anouncement on this ?

I can not understand why nvidia keeps so many tenique details underground, a lot of them are not trade secret but important issues for research, especially on HPC. if they intend to lead HPC, more details should be revealed.



Need serious explanation on how Fermi and Kepler handle 64bit Integers, as I am working on a project that need it. Moreover I hope there’s less penalty on 64bit integer vs 32bit, than double-precision vs float.

Good question, the docs seem to entirely omit details on how 64-bit integers are handled.

One thing is sure: the integer throughput on Kepler is not stellar. As far as I know, the relative throughput (wrt 32-bit fp) of many native arithmetic instructions has gotten considerably lower compared to CC 2.x.

Maybe that’s because NVIDIA are still (much) more of a consumer hardware than HPC company. Maybe I should even omit the maybe… ;)

I believe that a more open NVIDIA GPU computing ecosystem would be hugely beneficial for the computing community, research, and even industry. It would boost adoption and would also improve the confidence of the numerous computing geeks still wary about closed source/technology/mindset.

Fully agreed. Having the opaque ptxas “instruction set concealer” in the workflow that just isn’t up to par with current compiler technology is highly annoying. Maybe Nvidia could find a way to also base it on LLVM, just like the new compiler.