how does fermi join two core for DP fermi, double precision instruction

chexilong · March 21, 2012, 5:19am

papers and tech reports indicate that Fermi joins two cuda core for executing a double precision instruction. an SM features three sets of computing units, 1 set of 16 core , 1 set of 16 core, 1 set of 4 sfu,
then when an double precision floating point instruction is issued, which of the following condition happens ?
(a)16 core inside 1 set join together into 8DPU, and a warp is executed by this single set by 4 loops, namely 84=32.
(b)each core from set 1 joins with each core from set 2, makes 16DPU, and a warp is executed by these two sets by 2 loops, namely 162=32.
(c)none of the above is right, we have other explanation.

according to “Inside Fermi Nvidia’s HPC Push”, 64bit-integer instruction is performed in way (a), and 64bit-floating point instruction is performed in way (b), but it is only a guess, not technique report, anyone who can make a solid anouncement on this ?

I can not understand why nvidia keeps so many tenique details underground, a lot of them are not trade secret but important issues for research, especially on HPC. if they intend to lead HPC, more details should be revealed.

thanks.

parallelis · March 27, 2012, 7:55pm

+1

Need serious explanation on how Fermi and Kepler handle 64bit Integers, as I am working on a project that need it. Moreover I hope there’s less penalty on 64bit integer vs 32bit, than double-precision vs float.

pszilard · April 1, 2012, 11:03pm

Good question, the docs seem to entirely omit details on how 64-bit integers are handled.

One thing is sure: the integer throughput on Kepler is not stellar. As far as I know, the relative throughput (wrt 32-bit fp) of many native arithmetic instructions has gotten considerably lower compared to CC 2.x.

pszilard · April 1, 2012, 11:17pm

Maybe that’s because NVIDIA are still (much) more of a consumer hardware than HPC company. Maybe I should even omit the maybe… External Image

I believe that a more open NVIDIA GPU computing ecosystem would be hugely beneficial for the computing community, research, and even industry. It would boost adoption and would also improve the confidence of the numerous computing geeks still wary about closed source/technology/mindset.

tera · April 2, 2012, 12:26am

Fully agreed. Having the opaque ptxas “instruction set concealer” in the workflow that just isn’t up to par with current compiler technology is highly annoying. Maybe Nvidia could find a way to also base it on LLVM, just like the new compiler.

Topic		Replies	Views
performance of new nvidia chip CUDA Programming and Performance	15	6555	January 5, 2010
Is it possible to have FP Unit and INT Unit in a same core work in parallel? CUDA Programming and Performance	11	3990	March 5, 2019
High Compute in Flight, low DRAM Bandwidth usage CUDA Programming and Performance	35	561	January 19, 2025
Fermi architecture details where can I find them? CUDA Programming and Performance	16	4201	April 8, 2012
Forward looking GPU integer performance CUDA Programming and Performance	22	22259	March 20, 2017
Fermi Warp Sheduling CUDA Programming and Performance	1	3075	September 30, 2011
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20289	March 12, 2014
Fermi? Sounds interesting... CUDA Programming and Performance	58	16120	October 18, 2009
About instruction throughputs CUDA Programming and Performance	9	5268	May 27, 2010
CUDA Double Precision Performance 933 GFlops vs 78GFlops CUDA Programming and Performance	17	10235	March 9, 2009

how does fermi join two core for DP fermi, double precision instruction

Related topics