GTX 480 vs GTX 285, less MP more cores

Hi all,

I have a program which used to run on GTX 285. Recently I got a new GTX 480 and measured the time needed to run on GTX 285 and GTX 480. It turned out GTX 285 is faster. I checked the SPEC of the two GPUs, and think maybe less multiprocessors is the reason GTX 480 needs more time. GTX 480 has 15 MP, 480 cores, and GTX 285 has 30 MP, 240 cores.

The question is how can I modify my program to make it run faster on the “less MP more cores” GTX 480. For example, would it be faster if I try to remove some if…else… control flow statements. Is there any guide to this kind of performance improvement?

Best regards,
ning

The number of MPs wont be the problem here. I use a GTX 480, too, and had a 285 before. I see my app running with almost double the speed as you would expect it to do (in case its not memory bound) cause the relevant CUDA cores have doubled. Perhaps you want to look into smem accesses and the changed access of memory.

There are a lot of things to tweak and to check.

WarpSize of 64 is a problem for your code?

I find that tuning the block size is essential on GTX 480. While typical kernels on G200 would only vary 10% performance from the worst block size to the best, the same kernels running on GF100 GPUs have 50+% variation.

OMG they killed Kenny! erm… I meant they changed the Warp Size?

Yeah, Isnt FERMI’s warp-size 64?

No. 32, just like everything since the original G80.

Oops… Sorry bout the wrong info… Thanks for correcting, Avid…

phew :)

But i guess since having two active warps you NEVER want a block with less than 64 threads (if only one active block)?

Uhm this will rarely happen. Would mean u have a problem u need less than 64 threads for. If so and if I still had to do this on GPU it wouldnt matter if some threads are idle (except you run other kernels in parallel)…

One possible bottleneck: GF100 has only 15SMs, GT200 had 30. Both support up to 8 blocks per SM, so up to 240 concurrent blocks for GT200 vs. 120 for GF100.

So if you have many small blocks, try to make them larger. Otherwise there are not enough active warps to hide instruction/memory latency.
In my application, increasing block size from 64 to 96 threads improved performance on GTX480.

Additionally, GF100 needs twice as many active warps as the GT200, since one warp is processed in 2 cycles now instead of 4, while the instruction latency stayed the same. (See the FermiTuningGuide pdf for details)