Can you find any information about the genome processing they’re doing? Google didn’t find much. I’m really curious if they’re really doing assembly on GPUs, or using it for BLAST like database lookups, or Bowtie searches, or what.
I know BGI employed thousands of university grads at 2,000RMB/month (US$330/month plus free food and board). I am sure many of these kids will have time to write many things for the GPUs. I think they do have the need to develop their own new assembly programs because they are now the genome center with the highest throughput by a mile.
I have published some figures on the #19 More-8.5 machine in this forum about a month ago. The reason why that machine achieves only 18% of the peak performance is that they plugged 6 C2050 into each node. Unfortunately, the motherboard cannot provide such high PCI-E bandwidth, and the bi-directional async transfer is even slower than uni-directional transfer.
The #2 Nebula machine, however, have only 1 C2050 per node, thus does not have this problem. However, the HPL performance is only about 50% of the peak.
When testing HPL, they basically use the C2050 to perform the DGEMM operation, no other change is made to the HPL program. This means that in each iteration, they copy the matrix through PCI-E into C2050, do matrix multiplication, and them copy the result back. The DGEMM operation, using the code that NVIDIA gives, can only achieve less than 350GFLOPS per card. So it is not surprising that HPL gets less than 50% peak.
I think if we can rewrite HPL, so that the whole matrix stays in GPU memory, the result can be greatly improved.
P.S. The More-8.5 machine is designed to run programs that are extremely compute-intensive, and data will stay in GPU memory as long as possible. That’s why they decided that the PCI-E bandwidth won’t matter.
Thanks for your expert input. I think the top500 site should think about updating their benchmark given that GPU-based supercomptuers are on the rise.
DGEMM on Fermi can only reach 3/4 of peak DP.
These are some results for Linpack on a single node:
Dual Intel Xeon X5550 (2.67GHz) CPU, 48GB of memory plus a Tesla C2050 (1.150 GHz) card.
Peak DP CPUs performance: 85 GFlops
Peak DP GPU performance: 515 GFlops (448 cores*clock)
Peak DGEMM GPU performance = 3/4 Peak DP= 386 GFlops
WR11L2L2 Â Â Â Â Â Â 64000 Â Â 768 Â Â Â Â 1 Â Â Â Â 1 Â Â Â Â Â Â Â Â Â Â Â Â 461.95 Â Â Â Â Â Â Â Â Â Â Â Â Â 3.783e+02
If you use the peak DP of the system (600 GFlops): 63% efficiency
Considering the DGEMM peak of the system (472 GFlops): 80% efficiency
The Fermi DGEMM used in HPL can sustain around 360 Gflops for big matrices.
Since the code is combining CPU and GPU, most of the DGEMMs are running above 400 Gflops.
Increasing the problem size, will increase the efficiency.
Are these DGEMM performance numbers (300+ GFLOPS) using the CUBLAS library? I’m assuming the performance numbers are soley for the GPUs and not using CPUs as co-processors, correct?
I don’t think so. My understanding of NVIDIA’s hpl port (as described here) is that there are overlapping host and device DGEMM calls for the big DGEMM operations in the post panel factorization stage of the factorization.
There is another Tesla supercomputer coming at Tokyo Tech in Japan: