Cuda cores of GTX460

uesama · January 26, 2011, 11:37am

Dear forum,

I am running cuda3.2 on opensuse11.3/x86_64 with 2 GTX460(SLI). Nvidia’s driver is the latest one, 260.19.36.

Things go well mostly, except that two of sdk examples are not compiled properly.

Nvidia-settings reports that GTX460 has got 336 cuda cores, while deviceQuery reports each card has got

224 (number of cores). I run number crunching benchmarks, and compared results with the one on old 9800GTX+(SLI).

Considering the number of cores and clock rates, I came to guess that only 224 cuda cores are enabled on my cards.

I asked the card vendor about this, but they would not reply. (BTW, the cards do not look like a reference model.)

Has anyone of you encountered this kind of problem? Thanks for your attention.

avidday · January 26, 2011, 12:38pm

It is just a bug in deviceQuery. The CUDA API doesn’t return the number of cores in a MP, only the number of MP. The deviceQuery version you are using seesm to be assuming that there are 32 cores per MP (which is correct for Compute 2.0 cards), rather than the 48 it should use for Compute 2.1 cards.

uesama · January 27, 2011, 1:44am

Thanks for a quick reply.
I see the point. I should have read the sources.
I think I have to look for another reason why GTX460
is relatively slower than older cards.
I guess I have to wait for sdk3.2 for opensuse 11.3.

tera · January 27, 2011, 2:31am

It is quite reasonable if you find that in some cases your card behaves as if it only had 224 Cuda cores. Compute capability 2.1 devices have the (so far) unique property that at any one time these 336 cores are fed with instructions from only 224 threads. So to fully exploit the 336 cores, the scheduler has to be able to extract instruction level parallelism from at least half of the threads at any time.
Furthermore it seems that the register file bandwidth is not large enough to support concurrent execution of fma instructions on independent data on all cores concurrently, so it is hard to achieve peak GFLOP/s values.

All of this depends on the specific code, so your findings may vary on different kernels. And newer Cuda toolkits might produce optimized code for 2.1 devices. Last time I checked, I got identical object files for -arch=sm_20 and -arch=sm_21. But be sure to use the -arch=sm_21 command line switch, so that you will take advantage of improvements as soon as they make their way into the newest toolkit release.