Tesla C2050 performance comparision with C1060

tcbabu · July 23, 2010, 4:21pm

No It can not, it should refer to a GPU address. Proper mapping is done to device memory before initiating cudamain. The code is working on a 400GPU cluster in production mode. The code can be compiled exclusively for a CPU cluster with a single flag.Hence the code is validated very well.

mlohry · July 23, 2010, 6:30pm

Not quite the C2050, but my CFD work definitely benefited from the Fermi generation. Got roughly double the performance going from a GTX260 to GTX470, which is about what was expected.

David_Whittaker · July 23, 2010, 7:47pm

Do you know the register and local memory usage for your kernel? If it is low enough, I seem to remember 128 not being enough threads to maximize occupancy, either (an 8 block limit per SM?). Try making your MaxThreads 192 (or 256 if you need a power of 2).

Lev · July 23, 2010, 8:51pm

You provided very low information about your program, maybe you are using cuda 2.1 toolkit or something. I hope you did not compile for cuda 1.1 or 1.2 at least. Really some obviouse things needed to complete to port cuda program for Fermi. Once they are done, one can talk about performance.
"For C1060 the code failed to go beyond 128, many GPU threads started missing to compute. " it is strange, is it problem specific?
One possible thing that program is transfer bound.

tcbabu · July 24, 2010, 5:22am

We are using CUDA 3.0 stable release. As mentioned earlier we are submitting millions of CUDA threads. Since it is CFD (Computational Fluid Dynamics) it needs to do a lot of job in each thread, computing its fluxes, applying boundary conditions, Turbulence model etc. etc. A lot computations to resolve the physics. When we increased the size of threads beyond 128 , no errors were reported, but computation was wrong. On our analysis we found all the threads are not executed(!!!). Now we have a mechanism to make sure that all the threads are executed, otherwise program will stop by reporting the problem (nobody can afford wrong computation.).

Now I am not really worrying about optimizing the code for a particular hardware, that can be done. But the real benefit will come only from tuning the computational algorithms. By doing such improvements we could get 100% speed up. For example by grouping CELLs of similar type, to increase the data parallelism etc. etc.

Now my worry is that a C2050, which has almost double number of cores and 8 times double precision capability is performing worse than the C1060.

We are moving to CUDA 3.1, we will do all tuning, ECC off(!) etc etc. Then make a decision on the investment, we just got a single C2050.

Thanks

tcbabu · July 24, 2010, 5:26am

For our application such things are of no use.

tcbabu · July 24, 2010, 5:30am

Thanks, good news. But I am expecting a better performance since FERMI had 8 times double precision capability.

E.D_Riedijk · July 24, 2010, 6:25am

It sounds like your kernel uses too many registers to have more than 128 threads on a multiprocessor at a time (or uses too much shared memory with 128 threads). But that should show up when properly checking for errors. What was the runtime of the kernel when going beyond 128 registers? If it was extremely short, than the kernel probably did not launch and error checking should show it.

tcbabu · July 24, 2010, 8:08am

It is working fine up to 128 beyond 128 it started missing threads, runtime was not too small , but it was less (and I was happy), then I noticed the error in computation. Thread has to do a lot of things, it has to solve the physics. I think I am making all checks, ofcourse it is a belief. I have pasted part of the code elsewhere. All device related calls are checked. Actually I lost faith in CUDA after that. But we are managing with our own checking.

Lev · July 24, 2010, 8:46am

It is sign of error if program is dependend of number of threads in the block. Do you use sync threads? Really if is either compiler or program error. Does is work with many threads in debug mode?

tcbabu · July 24, 2010, 9:00am

Yes. But any other device related calls will only take place after finishing the threads.

Lev · July 24, 2010, 10:12am

Btw, you did not use switches like fast math and compilation for 1.0-1.1-1.2 by mistake in old variant?

tcbabu · July 24, 2010, 10:16am

No, no such problems

wzpdm · July 30, 2010, 6:12pm

Curiously, any progress?

laughingrice · July 31, 2010, 12:58pm

Sounds like you have a bug somewhere. Also from you performance comparison to the CPU, you are not utilizing the GPU properly.

As for the 128 threads, with Cuda either all threads run or the kernel returns an error. If not all threads run it sounds like you got some problem with your kernel. You should try running it with the memory checker as a start.

If you compile with --ptxas-options=-v what is the output?

By the way, do you have a lot of logic in your kernel? Cuda is a SIMD (i.e vector) architecture, if one thread in a warp performs an instruction, all do. If you have a lot of logic with if/else than you are wasting quite a bit of computation.

Also, with Fermi, quite a few things changed, warp sized remained the same, but shared memory changed (32 banks, 2 instruction per fetch instead of 16 banks / 1 instruction), it runs two half warps instead of 4 quarter warps, it runs two half warps from different warps in parallel on the same core, memory load/store is in full warps in memory serves instead of half warps, which means full warps should be local in memory instead of half warps. There is cache, but cache lines are 128b always, there is no longer 32/64/128 loads/stores, etc.

tcbabu · August 5, 2010, 5:08pm

Sounds like you have a bug somewhere. Also from you performance comparison to the CPU, you are not utilizing the GPU properly.

As for the 128 threads, with Cuda either all threads run or the kernel returns an error. If not all threads run it sounds like you got some problem with your kernel. You should try running it with the memory checker as a start.

If you compile with --ptxas-options=-v what is the output?

By the way, do you have a lot of logic in your kernel? Cuda is a SIMD (i.e vector) architecture, if one thread in a warp performs an instruction, all do. If you have a lot of logic with if/else than you are wasting quite a bit of computation.

Also, with Fermi, quite a few things changed, warp sized remained the same, but shared memory changed (32 banks, 2 instruction per fetch instead of 16 banks / 1 instruction), it runs two half warps instead of 4 quarter warps, it runs two half warps from different warps in parallel on the same core, memory load/store is in full warps in memory serves instead of half warps, which means full warps should be local in memory instead of half warps. There is cache, but cache lines are 128b always, there is no longer 32/64/128 loads/stores, etc.

Sorry for the delay in this reply. I was trying different options. BTW the code is running on a production mode in GPU based cluster. As I mentioned we have a cluster with 400 Tesla 1060 GPU’s housed in 100 machines, which is planned to upgrade to C2070. We are aware of the SIMD architecture, and did a lot of modifications in the code to make data parallel, to extend possible. Since it is a CFD code, the kernel has to do a lot of work. The performance we got from 1060 is not bad, in fact better than most of the cases published by NVIDIA. Our worry is that we are loosing performance on moving to aC2050, a demo GPU provided by NVIDIA.

We tried many options. Moved to 3.1 Version, ptxas fails with a segmentation fault. So back to 3.0. Tried different values for threads, it has no effect at all.

Switched of ECC off, yes some effect, now C2050 is slightly faster than C1060. Tried many streams simultaneously, no use. Increased cache size by reducing shared memory (we have no use for shared memory). Again no effect. BTW when we used thread size as 16 there was many missing. Segmentation fault with ptxas is not a good sign.

Hope there is no bug in the code, just a belief , who can be very sure. We are trying everything possible. Again the kernel has to do lot work and uses pointers heavily.

We are keep trying.

BTW I have not provided any comparison with CPU. 2 CPUS (16 threads) takes around 50 secs and 1 1060 takes 45 secs and 4 GPUs take 13 secs.

laughingrice · August 5, 2010, 5:47pm

Sorry for the delay in this reply. I was trying different options. BTW the code is running on a production mode in GPU based cluster. As I mentioned we have a cluster with 400 Tesla 1060 GPU’s housed in 100 machines, which is planned to upgrade to C2070. We are aware of the SIMD architecture, and did a lot of modifications in the code to make data parallel, to extend possible. Since it is a CFD code, the kernel has to do a lot of work. The performance we got from 1060 is not bad, in fact better than most of the cases published by NVIDIA. Our worry is that we are loosing performance on moving to aC2050, a demo GPU provided by NVIDIA.

We tried many options. Moved to 3.1 Version, ptxas fails with a segmentation fault. So back to 3.0. Tried different values for threads, it has no effect at all.

Switched of ECC off, yes some effect, now C2050 is slightly faster than C1060. Tried many streams simultaneously, no use. Increased cache size by reducing shared memory (we have no use for shared memory). Again no effect. BTW when we used thread size as 16 there was many missing. Segmentation fault with ptxas is not a good sign.

Hope there is no bug in the code, just a belief , who can be very sure. We are trying everything possible. Again the kernel has to do lot work and uses pointers heavily.

We are keep trying.

BTW I have not provided any comparison with CPU. 2 CPUS (16 threads) takes around 50 secs and 1 1060 takes 45 secs and 4 GPUs take 13 secs.

Another idea just came up regarding I ran into a bug some time ago (there was a thread around here somewhere). Are you running in exclusive mode by any chance?

I had issues with a second kernel on the same card crashing the first when in exclusive mode and with a high register usage (64 was the limit if memory serves). Some people reported the problem also with lower register count.

If you don’t have shared memory usage then my guess is that you would also probably gain little from cache usage. What about texture usage?

If memory serves the bandwidth on the c2050 is roughly x1.5 that of the c1060, so if your code is bandwidth limited that would be your bottleneck.

My guess is that if more than 128 threads cause erroneous results or some threads don’t seem to run, that you are overflowing resources somewhere.

tcbabu · August 6, 2010, 3:30pm

Another idea just came up regarding I ran into a bug some time ago (there was a thread around here somewhere). Are you running in exclusive mode by any chance?

I had issues with a second kernel on the same card crashing the first when in exclusive mode and with a high register usage (64 was the limit if memory serves). Some people reported the problem also with lower register count.

If you don’t have shared memory usage then my guess is that you would also probably gain little from cache usage. What about texture usage?

If memory serves the bandwidth on the c2050 is roughly x1.5 that of the c1060, so if your code is bandwidth limited that would be your bottleneck.

My guess is that if more than 128 threads cause erroneous results or some threads don’t seem to run, that you are overflowing resources somewhere.

Thanks again. It is not in exclusive mode, and also since I am testing the job, only a single job is run at a time. Missing of threads can be controlled by using only 128 threads. May be I am hitting the memory bandwidth.

laughingrice · August 6, 2010, 8:35pm

Cuda + MPI is essentially multiple jobs (assuming that is what you use to utilize all GPUs)

How do you map your kernels to the correct GPU?

laughingrice · August 6, 2010, 8:35pm

Cuda + MPI is essentially multiple jobs (assuming that is what you use to utilize all GPUs)

How do you map your kernels to the correct GPU?

Topic		Replies	Views
Tesla C2050 (Fermi) benchmarking results CUDA Programming and Performance	18	8821	September 22, 2010
Disappointed performance using C2050 CUDA Programming and Performance	20	7996	September 2, 2010
Tesla S2050 performance double precision performance too low CUDA Programming and Performance	42	29507	December 8, 2010
Fermi question CUDA Programming and Performance	30	5804	May 26, 2010
Tesla C2050 slower than GeForce 8800? CUDA Programming and Performance	14	21070	April 20, 2011
Lower then expected bandwidth on C2050 CUDA Programming and Performance	11	9202	October 26, 2010
Cores in Tesla c2050 card shows 112 cores instead of 448 CUDA Programming and Performance	6	11319	September 4, 2010
Code runs 3x times faster on X260 than on tesla c1060 CUDA Programming and Performance	21	6014	October 7, 2009
Comparing C1060, GTX470, GTX480 and C2050 Benchmark results of the Fermi Cards and Tesla generation CUDA Programming and Performance	9	26003	November 4, 2010
More details on new Tesla w/ Fermi GPU posted CUDA Programming and Performance	37	11756	December 12, 2009

Tesla C2050 performance comparision with C1060

Related topics