Tesla C2050 performance comparision with C1060

We have a fairly large cluster using Tesla C1060, which we plan to replace with C2070. We are using the cluster for CFD applications.
It is major CFD application with adaptive grid, using octree structures. We managed to port the code to the CPU-GPU cluster and the performance is okay, not very spectacular. We had the expectation that the performance will be better once we upgrade to C2070.
Recently we had the opportunity to test the code on a C2050 GPU. For our shock and surprise the performance was lower. A piece of computation which took 44 secs on C1060 took 49secs on C2070.

There was a report here a few days ago that turning off ECC helped improve performance.

Furthermore, from my experience, you also should play with the block thread size and look again at occupancy issues.

In my code moving to Fermi with just compiling with sm_20 didnt have any effect. Changing the block size from 64 to 128 gave

~30-40% performance boost (mostly because the algorithm is memory bounded and this is the difference in bw between the 1060 and 2050).

eyal

We are already using blocksize 128 in C1060. I just got the C2050 today. I expected very high performance, since our code is heavily double precision and also since C2050 has 512cores. When we use 2 C1060s the computation time has reduced from 44 secs to 25secs and to 17 secs for 4 C1060. We just have a single C2050 and the performance was 49 secs (for 10 iterations of the CFD code). I will try the performance by increasing the Block size. For C1060 the code failed to go beyond 128, many GPU threads started missing to compute. The code takes 40 secs on 2 CPUS(8cores/16threads). We can use CPU cores and GPUs together. My worry is whether FERMI is worth the money. THANKS.

http://www.nvidia.com/object/product_tesla…0_C2070_us.html

The C2050 has only 448 cores.

Got 512 but uses only 448, similar for C1060

I won’t try to ponder the philosophical difference between having 448 cores vs. 512 cores with 64 permanently disabled ( :) ), but the C1060 has 240 cores with none disabled. The difficulty that NVIDIA’s fabricator is having with their 40 nm process is driving them to use Fermi chips with up to two bad SMs (turned off, of course) in order to get acceptable yield. This was not required with the GT200-series GPUs used in the C1060.

Got 512 but uses only 448, similar for C1060

THANKS

Turning off ECC will probably already help a lot.

for those 128 threads / block. What are the dimensions more exactly?

EDIT: Oh, and could you run the visual profiler?

I will try ECC off. But has anybody got excellent performance from C2050 for an HPC application.

Thanks.

Ours is a LINUX cluster. We fire many million threads, mapped to a linear array, whose size depends on the user input.

All memory allocations are dynamic.

I’ve read several success stories on these boards for similar cards.

Ex for the GTX480 ( 480 SP’s @ around 144 GB/s, your card 448 SP’s @ 133 GB/s ):

http://forums.nvidia.com/index.php?showtop…amp;pid=1091217 — 950 GFLOPS

But i think that in general you shouldn’t expect massive performance improvements. For example the c1060 the has 102 GB/sec while the C2050 is up by 144 GB/s which means you probably wouldn’t expect a performance increase over 1.41x for bandwidth bound applications.

While the raw performance is up from 624 GFLOPS to 1030 GLFOPS ( single precision FMAD only) => 1.65x.

Why your current performance should be lower indicates that you are somehow under utilizing the card. Perhaps the fermi tuning guide can reveal some crux?

Thanks. Probably I am hitting memory bandwidth limit. Some tuning anyway one can do, but my expectation was much higher since Fermi has very high double precision capability also has much more cores to the computation. Probably the Fermi upgrade is not worth the money. Lot of effort has put in to port this code to GPUs. Finally few more CPUs can do the job. Code uses MPI for machine to machine communication and performs extremely well.

Thanks again.

I’m not sure how your statements correlate to my 2 questions but I can try replying anyways.

I’ve had no issues using the visual profiler and linux together. If you don’t have a visual interface you can use something like windows and Xming ( X server), right?

Normally one has quite fixed block dimensions and solve overlapping issues with simple if statements ( if( t_id < array_length) ).

How did you recompile your code?
If your code is double pres, you should get performance speedup.

Yes there seems to be some problem in getting what you mean, here is the piece of code for your information

global void Gpucalcginfo(CELLLINKS *clink,int MaxThds,int size) {

int i,j,k,l;

int n;

CELL *cinfo;

i= threadIdx.x;

l= blockIdx.x;

n = i+ MaxThds*l;

if( n < size ) {

 cinfo =  clink->gcinfo[n];

 d_calc_gcell(n,cinfo,clink);

}

}

int cudamain(CELLLINKS *clink) {

int Ng,Np;

int Ngridg,Ngridp;

int MaxThds=128;

dim3 gthrds(MaxThds,1);

dim3 pthrds(MaxThds,1);

Ng = clink->gcount;

Ngridg = (Ng)/MaxThds;

if( (Ng)%MaxThds != 0 ) Ngridg++;

dim3 grids (Ngridg,1);

Gpucalcginfo<<<grids,gthrds>>> (clink->GPUCelllink,MaxThds,Ng);

cutilSafeCall(cudaThreadSynchronize());

Np = clink->pcount;

Ngridp = (Np)/MaxThds;

if( (Np)%MaxThds != 0 ) Ngridp++;

dim3 pgrids (Ngridp,1);

Gpucalcpinfo<<<pgrids,pthrds>>> (clink->GPUCelllink,MaxThds,Np);

cutilSafeCall(cudaThreadSynchronize());

return 1;

}

My worry is that, I am getting lower performance on C2050, compared to C1060, though the code is heavily double precision.

The code referrers some memory, but is mostly computation intensive.

Actually we tried use to 3D grid also(ie. threadIdx.y , threadIdx.z etc as > 1), but it has no effect on the code.

Thanks again

There’s your problem. 128*128 is a total of 16k threads.

Fermi supports 1536 threads per SM, with 14 SMs = 21,504 threads. gthreads should be at least 168.

Edit: scratch that, I misread it… make sure that Ngridg and Ngridp are at least 168 to fully utilize the card.

Also, if you used any integer mul24’s in d_calc_gcell switch them to normal multiplies. It might be helpful to see that code as well if you can post it.

Ngridg is always a very very large number, since it is a CFD code ( in a sense a commercial one) we allow users generally use few million cells to a 100 million cell. Since we compile same code for cpu also we donot make very specific GPU related instructions.

For the maintainability of the we just want to keep a single version. the calc functions are quite involved.

Thanks again

Interesting, does cinfo = clink->gcinfo[n]; refer to host memory?