Tesla C2050 performance comparision with C1060

No It can not, it should refer to a GPU address. Proper mapping is done to device memory before initiating cudamain. The code is working on a 400GPU cluster in production mode. The code can be compiled exclusively for a CPU cluster with a single flag.Hence the code is validated very well.

Not quite the C2050, but my CFD work definitely benefited from the Fermi generation. Got roughly double the performance going from a GTX260 to GTX470, which is about what was expected.

Do you know the register and local memory usage for your kernel? If it is low enough, I seem to remember 128 not being enough threads to maximize occupancy, either (an 8 block limit per SM?). Try making your MaxThreads 192 (or 256 if you need a power of 2).

You provided very low information about your program, maybe you are using cuda 2.1 toolkit or something. I hope you did not compile for cuda 1.1 or 1.2 at least. Really some obviouse things needed to complete to port cuda program for Fermi. Once they are done, one can talk about performance.
"For C1060 the code failed to go beyond 128, many GPU threads started missing to compute. " it is strange, is it problem specific?
One possible thing that program is transfer bound.

We are using CUDA 3.0 stable release. As mentioned earlier we are submitting millions of CUDA threads. Since it is CFD (Computational Fluid Dynamics) it needs to do a lot of job in each thread, computing its fluxes, applying boundary conditions, Turbulence model etc. etc. A lot computations to resolve the physics. When we increased the size of threads beyond 128 , no errors were reported, but computation was wrong. On our analysis we found all the threads are not executed(!!!). Now we have a mechanism to make sure that all the threads are executed, otherwise program will stop by reporting the problem (nobody can afford wrong computation.).

Now I am not really worrying about optimizing the code for a particular hardware, that can be done. But the real benefit will come only from tuning the computational algorithms. By doing such improvements we could get 100% speed up. For example by grouping CELLs of similar type, to increase the data parallelism etc. etc.

Now my worry is that a C2050, which has almost double number of cores and 8 times double precision capability is performing worse than the C1060.

We are moving to CUDA 3.1, we will do all tuning, ECC off(!) etc etc. Then make a decision on the investment, we just got a single C2050.

Thanks

For our application such things are of no use.

Thanks, good news. But I am expecting a better performance since FERMI had 8 times double precision capability.

It sounds like your kernel uses too many registers to have more than 128 threads on a multiprocessor at a time (or uses too much shared memory with 128 threads). But that should show up when properly checking for errors. What was the runtime of the kernel when going beyond 128 registers? If it was extremely short, than the kernel probably did not launch and error checking should show it.

It is working fine up to 128 beyond 128 it started missing threads, runtime was not too small , but it was less (and I was happy), then I noticed the error in computation. Thread has to do a lot of things, it has to solve the physics. I think I am making all checks, ofcourse it is a belief. I have pasted part of the code elsewhere. All device related calls are checked. Actually I lost faith in CUDA after that. But we are managing with our own checking.

It is sign of error if program is dependend of number of threads in the block. Do you use sync threads? Really if is either compiler or program error. Does is work with many threads in debug mode?

Yes. But any other device related calls will only take place after finishing the threads.

Btw, you did not use switches like fast math and compilation for 1.0-1.1-1.2 by mistake in old variant?

No, no such problems

Curiously, any progress?

Sounds like you have a bug somewhere. Also from you performance comparison to the CPU, you are not utilizing the GPU properly.

As for the 128 threads, with Cuda either all threads run or the kernel returns an error. If not all threads run it sounds like you got some problem with your kernel. You should try running it with the memory checker as a start.

If you compile with --ptxas-options=-v what is the output?

By the way, do you have a lot of logic in your kernel? Cuda is a SIMD (i.e vector) architecture, if one thread in a warp performs an instruction, all do. If you have a lot of logic with if/else than you are wasting quite a bit of computation.

Also, with Fermi, quite a few things changed, warp sized remained the same, but shared memory changed (32 banks, 2 instruction per fetch instead of 16 banks / 1 instruction), it runs two half warps instead of 4 quarter warps, it runs two half warps from different warps in parallel on the same core, memory load/store is in full warps in memory serves instead of half warps, which means full warps should be local in memory instead of half warps. There is cache, but cache lines are 128b always, there is no longer 32/64/128 loads/stores, etc.

Sorry for the delay in this reply. I was trying different options. BTW the code is running on a production mode in GPU based cluster. As I mentioned we have a cluster with 400 Tesla 1060 GPU’s housed in 100 machines, which is planned to upgrade to C2070. We are aware of the SIMD architecture, and did a lot of modifications in the code to make data parallel, to extend possible. Since it is a CFD code, the kernel has to do a lot of work. The performance we got from 1060 is not bad, in fact better than most of the cases published by NVIDIA. Our worry is that we are loosing performance on moving to aC2050, a demo GPU provided by NVIDIA.

We tried many options. Moved to 3.1 Version, ptxas fails with a segmentation fault. So back to 3.0. Tried different values for threads, it has no effect at all.

Switched of ECC off, yes some effect, now C2050 is slightly faster than C1060. Tried many streams simultaneously, no use. Increased cache size by reducing shared memory (we have no use for shared memory). Again no effect. BTW when we used thread size as 16 there was many missing. Segmentation fault with ptxas is not a good sign.

Hope there is no bug in the code, just a belief , who can be very sure. We are trying everything possible. Again the kernel has to do lot work and uses pointers heavily.

We are keep trying.

BTW I have not provided any comparison with CPU. 2 CPUS (16 threads) takes around 50 secs and 1 1060 takes 45 secs and 4 GPUs take 13 secs.

Another idea just came up regarding I ran into a bug some time ago (there was a thread around here somewhere). Are you running in exclusive mode by any chance?

I had issues with a second kernel on the same card crashing the first when in exclusive mode and with a high register usage (64 was the limit if memory serves). Some people reported the problem also with lower register count.

If you don’t have shared memory usage then my guess is that you would also probably gain little from cache usage. What about texture usage?

If memory serves the bandwidth on the c2050 is roughly x1.5 that of the c1060, so if your code is bandwidth limited that would be your bottleneck.

My guess is that if more than 128 threads cause erroneous results or some threads don’t seem to run, that you are overflowing resources somewhere.

Thanks again. It is not in exclusive mode, and also since I am testing the job, only a single job is run at a time. Missing of threads can be controlled by using only 128 threads. May be I am hitting the memory bandwidth.

Cuda + MPI is essentially multiple jobs (assuming that is what you use to utilize all GPUs)

How do you map your kernels to the correct GPU?

Cuda + MPI is essentially multiple jobs (assuming that is what you use to utilize all GPUs)

How do you map your kernels to the correct GPU?