performance of new nvidia chip

Hi folks,
what do you think about this numbers (at the end of the page) that estimate new NVIDIA GT300 to be below ATI’s Radeon 5870 ? Of course, this is for specific application, which i think is the most case of GPU developers here

[url=“golubev.com”]golubev.com

Thanks for any comment

(As an author of linked paper) The numbers are OK :).

Only unclear part right now (for me) – can Fermi dual-issue two integer instruction per clock or not. Looks like it isn’t, so this dual-issue thing more like a HT on Intel’s CPUs – when we’re already near peak performance it’s simply useless. Also, SP count and frequency published for Tesla way lower than it was expected at September. So, unfortunately, for cryptography tasks upcoming GT300 (should we call it GF100?) GPUs won’t be that good. “Unfortunately” because still there no normally working SDK for ATI GPUs released right now.

From the other hand, for DPFP calculations Fermi will be just outstanding chip. And in many more applications too.

I have not seen anything to suggest that as far as I remember. I thought only double precision is not dual-issued (obviously)

Page 10 of the Fermi architecture whitepaper:

N.

thanks. there is a video on youtube with “fermi vs ati” comparison. it mentions that Fermi is targeted for HPC folks. If this is really true, Fermi based cards will cost a lot more than current ATIs within 400-600 dollar range. So i have to think about migration to ATI if costs are going to be the issue

You should be able to implement bit rotations using the bit-align instruction introduced with Direct3D 11 and supported on both Fermi and Cypress (computes ((a:b) >> c) & 0xffffffff, where a:b is the concatenation of two 32-bit operands).

This adds nothing to the “NVIDIA vs. AMD” debate, but should provide a nice further improvement compared to the previous generation.

Maybe some other tricks are possible…

For instance both G80 and Fermi support free binary negation of operands to logic instructions (allowing NOR, NAND, NXOR, ANDN…), and Fermi supports a left shift followed by an addition as a single instruction.

Edit: also, there is always the MAD24 instruction for computations such as 5*i+1 (much faster than adds).

One thing I spotted.

It appears he has taken GeForce 310 information from a Wikipedia comparision of Nvidia gpus and assumed it applied to the Fermi range.
(List of Nvidia graphics processing units - Wikipedia
eries)

I believe the GeForce 310 is just a rebadged 210 and not a fermi part.

A quite detailed, technical, moderated, and usually intelligent forum thread of Fermi analysis, hypothesis, and rumors is on the Beyond 3D site.

The summary is: We don’t know how Fermi will compare either for graphics or GPGPU yet. There’s just not enough information.

Does dual-issue doubles peak performance or helps to reach peak performance?.. My doubts coming for some unclear (for me at least) moments from Fermi’s whitepaper. At first, I don’t believe that marketing guys missed the opportunity to claim “integer performance now 4x times better” as it was done with DPFP. 2x from increased number of SPs and other 2x (or 4x for DPFP) from better architecture. Next my doubt is here

As SM contains 32 CUDA cores why warp scheduler using only half of them? Last doubt coming from peak performance for SPFP – GT200 with 240 CUDA cores can do 240 MADs per clock, Fermi with 512 CUDA cores can do 512 FMAs per clock. Why this number haven’t doubles? Is FMA that complex that it eats 2x more times? In this case won’t it be better to have 2x512 MADs instead of 512 FMAs?

Anyway, it’s only my doubts and as there are NVIDIA guys here they can easily say: “Yes, Fermi can performs 2 instructions per clock, so peak integer performance in fact doubles”. It’ll solve everything :).

Problem now is that NVIDIA doesn’t have a working new generation GPU while having mature CUDA SDK. And ATI have new generation GPU while lacking the working SDK for it. So if you have to choose right now it’s better to purchase some cheap ATI’s GPU (like 5770 if you don’t need DPFP) for tests. It can ends that you’ll stuck with pack of 5970’s which are theoretically very fast but practically it’s impossible to code them because of SDK problems. Check out ATI’s OpenCL forum, people facing problems everywhere, you can’t just take NVIDIA’s OpenCL program and compile it for ATI GPUs, it will (or should at least) works but performance can be just terrible.

If you can wait now – just wait. Either for Fermi’s release or for ATI fixing bugs in their SDK, dunno what will happens first.

Main question here – are they really exists in hardware or just emulated? For example, both ATI’s IL and nVidia’s PTX contains IMAD (integer multiply and add) instruction, however in reality there no hardware 32-bit IMADs. For ATI IMAD translates into MUL+ADD and MUL can only be performed at ‘t’ unit, thus peak performance dropped by 5x times. For nVidia there is a 24-bit IMAD, so 32-bit IMAD translated into series of 5 instructions.

While having hardware IMAD can accelerate cyclic rotate (by replacing “shr+shl+or” into “shr+imad”) in reality it doesn’t works at all.

Looking at ATI IL v2 “7.13 Multi-Media Instructions” there are:

bitalign dst, scr0, src1, src2 – Aligns bit data for video. This is a special instruction for multi-media video.

dst = (src0 << src2.x) || (src1 >> (32-src2.x)).

src2.x must be 0, 8,16, 24, or 32.

bytealign dst, scr0, src1, src2 – Aligns byte data for video. This is a special instruction for multi-media video.

dst = (src0 << 8* rc2.x) || (src1 >> 32-8*src2.x)).

src2.x must be 0, 1, 2, or 3.

As you can see, limitations of src2.x make it useless for cyclic rotation. However, bitalign in fact implemented in RV870 hardware.

That wiki page (which I btw was clearly referring to in my paper) constantly changing. Right now it have peak FLOPS for Fermi computed as #SPFrequency3 which is of course is wrong because there no “mythical +MUL” in Fermi. Anyway, SP count and frequency still in same estimations as it was 3+ months ago. Though for Tesla in official papers from nVidia these values already significantly dropped.

Yep. Almost everything about Fermi is speculations right now. And we can only blame nVidia about this ;).

~“We’ve tested Fermi and it’s fast” isn’t helps, seriously.

As far as I understand it’s like this for a SM:

The 32 cores are divided in 2 groups of 16. Each group processes a warp in 2 clocks, so 2 warps are processed in 2 clocks (in current generation a warp is processed in 4 clocks on 8 SPs). This is both true for 32bit floating point and 32bit integer.
For dual precision floating point, the logic of the 32 cores is combined to make 16 double precision cores (don’t know of a better way of expressing myself) are used to process a single warp in 2 clocks.

Right, but now AMD also supports MAD24 (MULADD_UINT24) in all 4 xyzw units starting with Cypress…

I find this limitation rather puzzling…

It’s not mentioned in the ISA document, and I don’t believe they would have introduced another instruction just to save a 3-bit shift…

We have yet to find the bit align instruction on Fermi, but it definitely supports bit insert as a single instruction (Cypress needs 2).

Yeah, I’m also understanding Fermi’s dual-issuing in the same way. So it’s like the HT thing in CPUs – in some cases it helps to reach peak performance but it’s still one SP (or CUDA core) can handle only one thread per clock, thus peak performance is simply #SP * Frequency. And for Fermi (being optimistic) it’s 512 * 1.6 (may be 1.8 a bit later after launch) = 819.2B while for 5870 it’s already 1360B.

I was too curious so I’ve test it. Actually there no limitation on src2.x, so bitalign works with any shift counter and it’s implemented in hardware, thus replacing 3 instructions for cyclic rotation with only one. Thanks for idea, Sylvain, it brings 20-25% speed-up for MD5 :). With another optimizations my 5770 hits 1800M for single MD5 hash. I like it.

You should probably compare the two in terms of flops:

ATI 5870 : 850 MHz x 1600 SP x 2 Ops/cycle (FMA) = 2.72 TFlop

NV Fermi: 1600-1800 MHz x 512 SP x 3 Ops/cycle (FMA+MUL) = 2.45 - 2.76 TFlop

N.

I was talking about integer performance not FP, so no x2/3.

But for SPFP (as I can understand because Fermi’s documentation quite poor) there no +MUL for Fermi, so it’ll be 512 * 1.6 * 2 = 1.638TF.

That should be 2 ops per cycle. The whole FMA+MUL dual issue thing is not mentioned anymore as far as I have seen. It also seems unlikely now that a multiprocessor has 32 instead of 8 SPs.

Where did you learn this? Perhaps from using the CUDA 3.0 toolkit beta and looking at the generated PTX for new opcodes?