GTX 470 Seems Slow... No Better than GTX 260?

Hi,

I have been programming in CUDA for about 6 months now, using a GTX 260 card (216 CUDA cores). Yesterday, I received my EVGA GTX 470 Fermi card. I was expecting my CUDA programs to double in computational power over the GTX 260. However, the Fermi card seems to complete the processing tasks in exactly the same time as the GTX 260. Any thoughts to my problem(s)? EVGA Precision shows 99 - 100% GPU usage for the GTX 470 when my CUDA programs are running, and I have compiled under version 3.0 of the CUDA Tools (including selecting the 2.0 architecture)… I am puzzled!

Thanks.

Do you use textures? what is your block size?

I do not use textures. Blocksize is typically 1000+ blocks, each with 512 threads.

If you have 512 threads, you probably have little number of registers per thread. check if your program is bandwidth bound or host transfer bound. It maybe just not compute bound. Think of utilizing large shared memory size or cache.

This could be a lot of things, it depends where the bottleneck is in your particular application. The compiler for Fermi is still being improved.

One thing that came up recently - some of the compiler defaults are different when compiling for sm_20. If you add “-ftz=true -prec-div=false -prec-sqrt=false” to the NVCC command line, does performance improve?

Btw, another reason of less performance is if your code is dominated by sqrt and divisions. Fermi has twice special function unit per multiprocessor, but number of multiprocessors is twice lower.