Tesla C2050-based Supercomputer ranked #2 in the world!

[url=“http://www.top500.org/list/2010/06/100”]http://www.top500.org/list/2010/06/100[/url]

The Chinese are surely rich and wise to use C2050s in the petaflops game.

Peak 2.98PFLOPS, ie 6,000 Tesla C02050s??? External Image

no there are some intel CPU’s there too

I know but still most of the PFLOPS should come from the Teslas

Looks like the Chinese like the idea of GPGPU very much. Their 2nd and 3rd ranked Supercomputer is based on AMD 4870 and Tesla respectively. No other countries in top 100 use GPGPU??? External Image

BTW, is LINPACK a fairly assessment tool for GPGPUs??? I suppose you need a LINPACK that is tailored for GPUs to be fair???

Trust me - this ranking is as fair as it gets.

And that’s one Hell of an accomplishment for the first time out!..

Roadrunner is the next closest thing (slid from #1 to #3), using a high-end variant of the Cell processor for 90% of the FLOPS.

[url=“http://www.hpcwire.com/home/specialfeaturetopitem/TOP500-Sluggish-But-Chinese-Supers-May-Portend-Big-Changes-Ahead-95271619.html”]http://www.hpcwire.com/home/specialfeature...d-95271619.html[/url]

Teslas contributed 2.32PFLOPS out of the peak 2.98PFLOPS. So it is more than 4,600 C2050s External Image

But how can you explain that CPU based systems usually have LINPACK running at 70-80% of peak PFLOPS but GPU based systems usually have less than 50% of peak PFLOPS?

I imagine a lot of it has to do with PCIe bandwidth. But I don’t know how you explain the Mole-8.5 system at only 18.2%:

[url=“http://top500.org/system/10561”]http://top500.org/system/10561[/url]

I’ve never seen a system with such a low Rmax compared to Rpeak.

hooray :)

Hey Mr Murray, can you explain to us why we can only perform up to a low percentage of our peak??? Thanks a lot!

I just noticed that the #2 is located in Shenzhen

Shenzhen is also the headquarter of BGI who bought 128 Illumina HiSeq 2000 machines to do DNA sequencing. Supposedly the machines can allow them to sequence 1,000+ genomes per year at 30x coverage. I am wondering if that’s what these Teslas will be used for. External Image

Here’s a follow-up article on CNet.

[url=“http://news.cnet.com/8301-13924_3-20006450-64.html?part=rss&subj=news&tag=2547-1_3-0-20”]http://news.cnet.com/8301-13924_3-20006450...g=2547-1_3-0-20[/url]

edit -

If I’m reading this correctly, it would seem the PCIe link is indeed the reason for the disparity in their eyes as well…

Does anyone who knows care take this opportunity to offer their perspective on the state of the art for using HPL to measure the performance of GPU clusters?

The general notion at ISC’10 (where this has been announced) was:

  • The Nebulae system is one hell of a power efficient machine, only because of the accelerators (aka Fermis) in it. People complained a lot that the 2-4 MW number is sort of unofficial compared to quoted numbers for e.g. Jaguar at Oak Ridge.

  • The PFLOP/s counting folks complained a lot about the extreme (in TOP500 standards) deviation between Rmax and Rpeak on the four GPU-accelerated machines. But this is just another notion of why Linpack and HPL is not necessarily a good metric to measure perf. The Nebulae architects most probably didn’t design this baby to max out in HPL :)

Apart from that, it was really fun to note how ISC’10 was centered around accelerators. Intel rebranded their Larrabee as MIC in a keynote (aka they quit the graphics market and joined the HPC world with the-design-formerly-known-as-LRB). Mellanox announced RDMA from CUDA-page-locked memory into their IB fabric as I reported in another thread. Half of the exhibition booths had GPUs in them. Folks at NERSC, at vendors, from the bigbuck labs and from all over Europe and the middle and far east kept on asking: What kind of application level performance can I get from using GPUs, now that their power efficiency is clearly established?

I tried my best to not amplify common GPGPU criticism (compare with a singlecore CPU reference, compare with unoptimised CPU code) in my talk about a recent collaboration on doing seismic wave propagation on 192 GPUs (the largest cluster we could get our hands on). In case someone is interested, the paper DOI is [url=“Modeling the propagation of elastic waves using spectral elements on a cluster of 192 GPUs | SpringerLink”]http://dx.doi.org/10.1007/s00450-010-0109-1[/url] and the slides are available on my homepage: [url=“http://www.mathematik.tu-dortmund.de/~goeddeke/pubs/talks/talk_isc2010.pdf”]http://www.mathematik.tu-dortmund.de/~goed...alk_isc2010.pdf[/url]

So LINPACK will take into account of the link between nodes also? If you use an 12X Infiniband QDR at 96Gbps, then you will get a higher LINPACK score than 10Gbps ethernet?

Actually, the interviewee lives about thirty minutes from me and I was trying to contact him through a friend at UT, but no luck thus far. Bad timing this weekend.

What little I’ve gathered was through the friend, obviously, and not the good Doctor. Point taken - I’m not that qualified and I’ll bow out…

LINPACK certainly uses the network but it is well known that good performance doesn’t dependent on having a great network. The systems these days have so much memory that the ratio of communication to computation is so low that it’s just a FLOPS contest. That’s why you see so many clusters make it in to the Top 500 just using gigabit ethernet. The regular joke is you could probably do the communication using floppies, or USB flash drives these days, and still get a decent score. A better benchmark of system performance is HPCC:

http://icl.cs.utk.edu/hpcc/

But the benchmarks aren’t well defined for heterogeneous computing since it isn’t specified whether they refer to performance only within GPU memory or you need to include the PCIe data transfer costs.

As I expected, this computer will be used for gene sequencing (actually genome assembly I think). External Image

[url=“http://www.mnn.com/green-tech/computers/stories/china-boasts-worlds-second-fastest-supercomputer”]http://www.mnn.com/green-tech/computers/st...t-supercomputer[/url]

China boasts world’s second-fastest supercomputer

Mother Nature Network

2010-06-08

China’s ambitions to become a major global power in the world of supercomputing were given a boost when one of its machines was ranked second-fastest in a survey.

The Nebulae machine at the National Supercomputing Centre in the southern city of Shenzhen can perform at 1.271 petaflops per second, according to the Top 500 survey, which ranks supercomputers.

A petaflop is equivalent to 1,000 trillion calculations.

The United States still dominates the list, holding top spot with its Jaguar supercomputer at a government facility in Tennessee, and more than half of the systems on the list, released at a supercomputing conference in Germany.

But China has a total of 24 systems on the list, and two in the top ten, with the Tianhe-1 supercomputer in Tianjin ranking number seven.
And the Nebulae, built by Dawning Information Industry Co., Ltd., has a theoretical speed of 2.98 petaflops per second, which would make it the fastest in the world.

The machine’s uses include scientific computing and gene sequencing, according to Chinese state media.

Calls to the company for further comment went unanswered.

The supercomputers on the Top 500 list are rated based on speed of performance in a benchmark test. Submissions are voluntary, so it does not include all machines.

The survey is produced twice-yearly.