what do you think about this numbers (at the end of the page) that estimate new NVIDIA GT300 to be below ATI’s Radeon 5870 ? Of course, this is for specific application, which i think is the most case of GPU developers here
(As an author of linked paper) The numbers are OK :).
Only unclear part right now (for me) – can Fermi dual-issue two integer instruction per clock or not. Looks like it isn’t, so this dual-issue thing more like a HT on Intel’s CPUs – when we’re already near peak performance it’s simply useless. Also, SP count and frequency published for Tesla way lower than it was expected at September. So, unfortunately, for cryptography tasks upcoming GT300 (should we call it GF100?) GPUs won’t be that good. “Unfortunately” because still there no normally working SDK for ATI GPUs released right now.
From the other hand, for DPFP calculations Fermi will be just outstanding chip. And in many more applications too.
thanks. there is a video on youtube with “fermi vs ati” comparison. it mentions that Fermi is targeted for HPC folks. If this is really true, Fermi based cards will cost a lot more than current ATIs within 400-600 dollar range. So i have to think about migration to ATI if costs are going to be the issue
You should be able to implement bit rotations using the bit-align instruction introduced with Direct3D 11 and supported on both Fermi and Cypress (computes ((a:b) >> c) & 0xffffffff, where a:b is the concatenation of two 32-bit operands).
This adds nothing to the “NVIDIA vs. AMD” debate, but should provide a nice further improvement compared to the previous generation.
Maybe some other tricks are possible…
For instance both G80 and Fermi support free binary negation of operands to logic instructions (allowing NOR, NAND, NXOR, ANDN…), and Fermi supports a left shift followed by an addition as a single instruction.
Edit: also, there is always the MAD24 instruction for computations such as 5*i+1 (much faster than adds).
Does dual-issue doubles peak performance or helps to reach peak performance?.. My doubts coming for some unclear (for me at least) moments from Fermi’s whitepaper. At first, I don’t believe that marketing guys missed the opportunity to claim “integer performance now 4x times better” as it was done with DPFP. 2x from increased number of SPs and other 2x (or 4x for DPFP) from better architecture. Next my doubt is here
As SM contains 32 CUDA cores why warp scheduler using only half of them? Last doubt coming from peak performance for SPFP – GT200 with 240 CUDA cores can do 240 MADs per clock, Fermi with 512 CUDA cores can do 512 FMAs per clock. Why this number haven’t doubles? Is FMA that complex that it eats 2x more times? In this case won’t it be better to have 2x512 MADs instead of 512 FMAs?
Anyway, it’s only my doubts and as there are NVIDIA guys here they can easily say: “Yes, Fermi can performs 2 instructions per clock, so peak integer performance in fact doubles”. It’ll solve everything :).
Problem now is that NVIDIA doesn’t have a working new generation GPU while having mature CUDA SDK. And ATI have new generation GPU while lacking the working SDK for it. So if you have to choose right now it’s better to purchase some cheap ATI’s GPU (like 5770 if you don’t need DPFP) for tests. It can ends that you’ll stuck with pack of 5970’s which are theoretically very fast but practically it’s impossible to code them because of SDK problems. Check out ATI’s OpenCL forum, people facing problems everywhere, you can’t just take NVIDIA’s OpenCL program and compile it for ATI GPUs, it will (or should at least) works but performance can be just terrible.
If you can wait now – just wait. Either for Fermi’s release or for ATI fixing bugs in their SDK, dunno what will happens first.
Main question here – are they really exists in hardware or just emulated? For example, both ATI’s IL and nVidia’s PTX contains IMAD (integer multiply and add) instruction, however in reality there no hardware 32-bit IMADs. For ATI IMAD translates into MUL+ADD and MUL can only be performed at ‘t’ unit, thus peak performance dropped by 5x times. For nVidia there is a 24-bit IMAD, so 32-bit IMAD translated into series of 5 instructions.
While having hardware IMAD can accelerate cyclic rotate (by replacing “shr+shl+or” into “shr+imad”) in reality it doesn’t works at all.
Looking at ATI IL v2 “7.13 Multi-Media Instructions” there are:
bitalign dst, scr0, src1, src2 – Aligns bit data for video. This is a special instruction for multi-media video.
dst = (src0 << src2.x) || (src1 >> (32-src2.x)).
src2.x must be 0, 8,16, 24, or 32.
bytealign dst, scr0, src1, src2 – Aligns byte data for video. This is a special instruction for multi-media video.
As you can see, limitations of src2.x make it useless for cyclic rotation. However, bitalign in fact implemented in RV870 hardware.
That wiki page (which I btw was clearly referring to in my paper) constantly changing. Right now it have peak FLOPS for Fermi computed as #SPFrequency3 which is of course is wrong because there no “mythical +MUL” in Fermi. Anyway, SP count and frequency still in same estimations as it was 3+ months ago. Though for Tesla in official papers from nVidia these values already significantly dropped.
Yep. Almost everything about Fermi is speculations right now. And we can only blame nVidia about this ;).
~“We’ve tested Fermi and it’s fast” isn’t helps, seriously.
The 32 cores are divided in 2 groups of 16. Each group processes a warp in 2 clocks, so 2 warps are processed in 2 clocks (in current generation a warp is processed in 4 clocks on 8 SPs). This is both true for 32bit floating point and 32bit integer.
For dual precision floating point, the logic of the 32 cores is combined to make 16 double precision cores (don’t know of a better way of expressing myself) are used to process a single warp in 2 clocks.
Yeah, I’m also understanding Fermi’s dual-issuing in the same way. So it’s like the HT thing in CPUs – in some cases it helps to reach peak performance but it’s still one SP (or CUDA core) can handle only one thread per clock, thus peak performance is simply #SP * Frequency. And for Fermi (being optimistic) it’s 512 * 1.6 (may be 1.8 a bit later after launch) = 819.2B while for 5870 it’s already 1360B.
I was too curious so I’ve test it. Actually there no limitation on src2.x, so bitalign works with any shift counter and it’s implemented in hardware, thus replacing 3 instructions for cyclic rotation with only one. Thanks for idea, Sylvain, it brings 20-25% speed-up for MD5 :). With another optimizations my 5770 hits 1800M for single MD5 hash. I like it.