It is just a bug in deviceQuery. The CUDA API doesn’t return the number of cores in a MP, only the number of MP. The deviceQuery version you are using seesm to be assuming that there are 32 cores per MP (which is correct for Compute 2.0 cards), rather than the 48 it should use for Compute 2.1 cards.
Thanks for a quick reply.
I see the point. I should have read the sources.
I think I have to look for another reason why GTX460
is relatively slower than older cards.
I guess I have to wait for sdk3.2 for opensuse 11.3.
It is quite reasonable if you find that in some cases your card behaves as if it only had 224 Cuda cores. Compute capability 2.1 devices have the (so far) unique property that at any one time these 336 cores are fed with instructions from only 224 threads. So to fully exploit the 336 cores, the scheduler has to be able to extract instruction level parallelism from at least half of the threads at any time.
Furthermore it seems that the register file bandwidth is not large enough to support concurrent execution of fma instructions on independent data on all cores concurrently, so it is hard to achieve peak GFLOP/s values.
All of this depends on the specific code, so your findings may vary on different kernels. And newer Cuda toolkits might produce optimized code for 2.1 devices. Last time I checked, I got identical object files for -arch=sm_20 and -arch=sm_21. But be sure to use the -arch=sm_21 command line switch, so that you will take advantage of improvements as soon as they make their way into the newest toolkit release.