Tesla C2050 performance comparision with C1060

tcbabu · July 22, 2010, 2:04pm

We have a fairly large cluster using Tesla C1060, which we plan to replace with C2070. We are using the cluster for CFD applications.
It is major CFD application with adaptive grid, using octree structures. We managed to port the code to the CPU-GPU cluster and the performance is okay, not very spectacular. We had the expectation that the performance will be better once we upgrade to C2070.
Recently we had the opportunity to test the code on a C2050 GPU. For our shock and surprise the performance was lower. A piece of computation which took 44 secs on C1060 took 49secs on C2070.

eyalhir74 · July 22, 2010, 2:36pm

There was a report here a few days ago that turning off ECC helped improve performance.

Furthermore, from my experience, you also should play with the block thread size and look again at occupancy issues.

In my code moving to Fermi with just compiling with sm_20 didnt have any effect. Changing the block size from 64 to 128 gave

~30-40% performance boost (mostly because the algorithm is memory bounded and this is the difference in bw between the 1060 and 2050).

eyal

tcbabu · July 22, 2010, 2:58pm

We are already using blocksize 128 in C1060. I just got the C2050 today. I expected very high performance, since our code is heavily double precision and also since C2050 has 512cores. When we use 2 C1060s the computation time has reduced from 44 secs to 25secs and to 17 secs for 4 C1060. We just have a single C2050 and the performance was 49 secs (for 10 iterations of the CFD code). I will try the performance by increasing the Block size. For C1060 the code failed to go beyond 128, many GPU threads started missing to compute. The code takes 40 secs on 2 CPUS(8cores/16threads). We can use CPU cores and GPUs together. My worry is whether FERMI is worth the money. THANKS.

cho · July 22, 2010, 3:27pm

[url=“http://www.nvidia.com/object/product_tesla_C2050_C2070_us.html”]http://www.nvidia.com/object/product_tesla...0_C2070_us.html[/url]

The C2050 has only 448 cores.

tcbabu · July 22, 2010, 3:29pm

Got 512 but uses only 448, similar for C1060

seibert · July 22, 2010, 3:59pm

I won’t try to ponder the philosophical difference between having 448 cores vs. 512 cores with 64 permanently disabled ( :) ), but the C1060 has 240 cores with none disabled. The difficulty that NVIDIA’s fabricator is having with their 40 nm process is driving them to use Fermi chips with up to two bad SMs (turned off, of course) in order to get acceptable yield. This was not required with the GT200-series GPUs used in the C1060.

tcbabu · July 22, 2010, 4:01pm

Got 512 but uses only 448, similar for C1060

tcbabu · July 22, 2010, 4:21pm

THANKS

E.D_Riedijk · July 22, 2010, 5:36pm

Turning off ECC will probably already help a lot.

Jimmy_Pettersson · July 23, 2010, 12:09am

for those 128 threads / block. What are the dimensions more exactly?

EDIT: Oh, and could you run the visual profiler?

tcbabu · July 23, 2010, 1:47am

I will try ECC off. But has anybody got excellent performance from C2050 for an HPC application.

Thanks.

tcbabu · July 23, 2010, 2:14am

Ours is a LINUX cluster. We fire many million threads, mapped to a linear array, whose size depends on the user input.

All memory allocations are dynamic.

Jimmy_Pettersson · July 23, 2010, 2:20am

I’ve read several success stories on these boards for similar cards.

Ex for the GTX480 ( 480 SP’s @ around 144 GB/s, your card 448 SP’s @ 133 GB/s ):

[url=“http://forums.nvidia.com/index.php?showtopic=174812&mode=threaded&pid=1091217”]http://forums.nvidia.com/index.php?showtop...amp;pid=1091217[/url] — 950 GFLOPS

But i think that in general you shouldn’t expect massive performance improvements. For example the c1060 the has 102 GB/sec while the C2050 is up by 144 GB/s which means you probably wouldn’t expect a performance increase over 1.41x for bandwidth bound applications.

While the raw performance is up from 624 GFLOPS to 1030 GLFOPS ( single precision FMAD only) => 1.65x.

Why your current performance should be lower indicates that you are somehow under utilizing the card. Perhaps the fermi tuning guide can reveal some crux?

tcbabu · July 23, 2010, 2:49am

Thanks. Probably I am hitting memory bandwidth limit. Some tuning anyway one can do, but my expectation was much higher since Fermi has very high double precision capability also has much more cores to the computation. Probably the Fermi upgrade is not worth the money. Lot of effort has put in to port this code to GPUs. Finally few more CPUs can do the job. Code uses MPI for machine to machine communication and performs extremely well.

Thanks again.

Jimmy_Pettersson · July 23, 2010, 2:54am

I’m not sure how your statements correlate to my 2 questions but I can try replying anyways.

I’ve had no issues using the visual profiler and linux together. If you don’t have a visual interface you can use something like windows and Xming ( X server), right?

Normally one has quite fixed block dimensions and solve overlapping issues with simple if statements ( if( t_id < array_length) ).

Lev · July 23, 2010, 9:55am

How did you recompile your code?
If your code is double pres, you should get performance speedup.

tcbabu · July 23, 2010, 12:47pm

Yes there seems to be some problem in getting what you mean, here is the piece of code for your information

global void Gpucalcginfo(CELLLINKS *clink,int MaxThds,int size) {

int i,j,k,l;

int n;

CELL *cinfo;

i= threadIdx.x;

l= blockIdx.x;

n = i+ MaxThds*l;

if( n < size ) {

 cinfo =  clink->gcinfo[n];

 d_calc_gcell(n,cinfo,clink);

}

int cudamain(CELLLINKS *clink) {

int Ng,Np;

int Ngridg,Ngridp;

int MaxThds=128;

dim3 gthrds(MaxThds,1);

dim3 pthrds(MaxThds,1);

Ng = clink->gcount;

Ngridg = (Ng)/MaxThds;

if( (Ng)%MaxThds != 0 ) Ngridg++;

dim3 grids (Ngridg,1);

Gpucalcginfo<<<grids,gthrds>>> (clink->GPUCelllink,MaxThds,Ng);

cutilSafeCall(cudaThreadSynchronize());

Np = clink->pcount;

Ngridp = (Np)/MaxThds;

if( (Np)%MaxThds != 0 ) Ngridp++;

dim3 pgrids (Ngridp,1);

Gpucalcpinfo<<<pgrids,pthrds>>> (clink->GPUCelllink,MaxThds,Np);

cutilSafeCall(cudaThreadSynchronize());

return 1;

}

My worry is that, I am getting lower performance on C2050, compared to C1060, though the code is heavily double precision.

The code referrers some memory, but is mostly computation intensive.

Actually we tried use to 3D grid also(ie. threadIdx.y , threadIdx.z etc as > 1), but it has no effect on the code.

Thanks again

David_Whittaker · July 23, 2010, 2:07pm

There’s your problem. 128*128 is a total of 16k threads.

Fermi supports 1536 threads per SM, with 14 SMs = 21,504 threads. gthreads should be at least 168.

Edit: scratch that, I misread it… make sure that Ngridg and Ngridp are at least 168 to fully utilize the card.

Also, if you used any integer mul24’s in d_calc_gcell switch them to normal multiplies. It might be helpful to see that code as well if you can post it.

tcbabu · July 23, 2010, 3:23pm

Ngridg is always a very very large number, since it is a CFD code ( in a sense a commercial one) we allow users generally use few million cells to a 100 million cell. Since we compile same code for cpu also we donot make very specific GPU related instructions.

For the maintainability of the we just want to keep a single version. the calc functions are quite involved.

Thanks again

Lev · July 23, 2010, 4:12pm

Interesting, does cinfo = clink->gcinfo[n]; refer to host memory?

Topic		Replies	Views
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8821	September 22, 2010
Disappointed performance using C2050 CUDA Programming and Performance	20	7995	September 2, 2010
Tesla S2050 performance double precision performance too low CUDA Programming and Performance	42	29507	December 8, 2010
Fermi question CUDA Programming and Performance	30	5804	May 26, 2010
Tesla C2050 slower than GeForce 8800? CUDA Programming and Performance	14	21070	April 20, 2011
Lower then expected bandwidth on C2050 CUDA Programming and Performance	11	9202	October 26, 2010
Cores in Tesla c2050 card shows 112 cores instead of 448 CUDA Programming and Performance	6	11318	September 4, 2010
Code runs 3x times faster on X260 than on tesla c1060 CUDA Programming and Performance	21	6013	October 7, 2009
Comparing C1060, GTX470, GTX480 and C2050 Benchmark results of the Fermi Cards and Tesla generation CUDA Programming and Performance	9	26003	November 4, 2010
More details on new Tesla w/ Fermi GPU posted CUDA Programming and Performance	37	11754	December 12, 2009

Tesla C2050 performance comparision with C1060

Related topics