only 114 cores on Tesla C2070 !

Dear List,

We we have a Tesla C2070 running on a Fedora 13
installed with the recent Nvidia drivers 3.2 (260.19.14).

When I run “devicequery”, I get the following info:

There is 1 device supporting CUDA

Device 0: “Tesla C2070”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1341587456 bytes
Number of multiprocessors: 14
Number of cores: 112
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

To my great surprise, there is only 114 cores detected,
instead of the 448 I expected.
The Total amount of global memory is also much less than the 6G expected.

Its more or less like only 1/4 of the multiprocessors are detected.

How can I understand this ?

Dear List,

We we have a Tesla C2070 running on a Fedora 13
installed with the recent Nvidia drivers 3.2 (260.19.14).

When I run “devicequery”, I get the following info:

There is 1 device supporting CUDA

Device 0: “Tesla C2070”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1341587456 bytes
Number of multiprocessors: 14
Number of cores: 112
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

To my great surprise, there is only 114 cores detected,
instead of the 448 I expected.
The Total amount of global memory is also much less than the 6G expected.

Its more or less like only 1/4 of the multiprocessors are detected.

How can I understand this ?

It looks like you are running a old version of deviceQuery. The CUDA APIs only return the number of multiprocessors, not the number of cores, so the original deviceQuery code had 8 cores per multiprocessor hard coded into it. For Fermi cards, that is incorrect, there are either 32 or 48 cores per MP. That is where the factor of four comes from. You can safely ignore the discrepancy, it is not real.

It looks like you are running a old version of deviceQuery. The CUDA APIs only return the number of multiprocessors, not the number of cores, so the original deviceQuery code had 8 cores per multiprocessor hard coded into it. For Fermi cards, that is incorrect, there are either 32 or 48 cores per MP. That is where the factor of four comes from. You can safely ignore the discrepancy, it is not real.

Ok, thanks,

indeed, I was using an old version of deviceQuery that I wrapped for python.

The updated version now works perfectly.

yves

Did it also fix the memory discrepancy?