GeForce GTX 460 & CUDA 3.1 (What is deviceQuery reporting?)

Hello,

I recently upgraded my base system after a 5 year period (yay!). As part of the upgrade, I installed a GF104 based GeForce GTX 460 card.

While performing a shake down of the system to check for proper functioning, I run a variety of examples from the SDK, including the deviceQuery example. Running deviceQuery on my new system reports the following:

There are 2 devices supporting CUDA

Device 0: "GeForce GTX 460"

  CUDA Driver Version:						   3.10

  CUDA Runtime Version:						  3.10

  CUDA Capability Major revision number:		 2

  CUDA Capability Minor revision number:		 1

  Total amount of global memory:				 1073414144 bytes

  Number of multiprocessors:					 7

  Number of cores:							   224

  Total amount of constant memory:			   65536 bytes

  Total amount of shared memory per block:	   49152 bytes

  Total number of registers available per block: 32768

  Warp size:									 32

  Maximum number of threads per block:		   1024

  Maximum sizes of each dimension of a block:	1024 x 1024 x 64

  Maximum sizes of each dimension of a grid:	 65535 x 65535 x 1

  Maximum memory pitch:						  2147483647 bytes

  Texture alignment:							 512 bytes

  Clock rate:									1.43 GHz

  Concurrent copy and execution:				 Yes

  Run time limit on kernels:					 No

  Integrated:									No

  Support host page-locked memory mapping:	   Yes

  Compute mode:								  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:				   Yes

  Device has ECC support enabled:				No

Device 1: "GeForce 210"

  CUDA Driver Version:						   3.10

  CUDA Runtime Version:						  3.10

  CUDA Capability Major revision number:		 1

  CUDA Capability Minor revision number:		 2

  Total amount of global memory:				 536150016 bytes

  Number of multiprocessors:					 2

  Number of cores:							   16

... snip ...

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.10, CUDA Runtime Version = 3.10, NumDevs = 2, Device = GeForce GTX 460, Device = GeForce 210

PASSED

Press <Enter> to Quit...

According to the GeForce GTX 460 specifications (see GeForce GTX 460), the GeForce GTX 460 is supposed to have 336 CUDA (SP) cores. deviceQuery however is reporting only 224 cores. The multiprocessor count appears to be correct, i.e. 7, for a multiprocessor which contains 48 CUDA (SP) cores as the GF104 architecture is supposed to contain based upon my research.

It is noted that v. 3.1 of the NVIDIA CUDA C Programming Guide does not list in Appedix A the GeForce GTX 460 as a CUDA-Enabled GPU.

Is NVIDIA planning on updating the CUDA runtime for the GF104 (GeForce GTX 460) architecture? Also, can the current CUDA 3.1 runtime system be used for development and have full access to the GF104 capabilities, i.e. all 48 cores per multiprocessor?

Thanks,

dpe

===

Intel Core i7 930 (2.8 GHz), 6GB DDR3 RAM (4.8 GT/s); GIGABYTE GeForce GTX 460 & EVGA GeForce 210; GIGABYTE GA-X58A-UD3R Motherboard; Windows 7 Pro, Ubuntu 10.04, & CentOS 5.5

deviceQuery does not actually count the number of CUDA cores. Instead, it counts the number of multiprocessors and multiplies by a scale factor of 8 for compute capability 1.x and 32 for compute capability 2.x. As you note, this is not correct for compute capability 2.1 devices, where the scale factor should be 48. Presumably, this patch to deviceQuery will be made for the CUDA 3.2 release. All the cores are present and available for use even in CUDA 3.1, though.

You can make use of the GF104 capabilities now, but you should be aware that utilization of the extra 16 CUDA cores on each multiprocessor depends on the instruction sequence in your kernel. Many kernels are not seeing the benefit, although some synthetic benchmarks do show all 48 CUDA cores are present and functioning. We are hoping that CUDA 3.2 will bring some compiler and driver improvements to allow better performance on compute 2.1.

deviceQuery does not actually count the number of CUDA cores. Instead, it counts the number of multiprocessors and multiplies by a scale factor of 8 for compute capability 1.x and 32 for compute capability 2.x. As you note, this is not correct for compute capability 2.1 devices, where the scale factor should be 48. Presumably, this patch to deviceQuery will be made for the CUDA 3.2 release. All the cores are present and available for use even in CUDA 3.1, though.

You can make use of the GF104 capabilities now, but you should be aware that utilization of the extra 16 CUDA cores on each multiprocessor depends on the instruction sequence in your kernel. Many kernels are not seeing the benefit, although some synthetic benchmarks do show all 48 CUDA cores are present and functioning. We are hoping that CUDA 3.2 will bring some compiler and driver improvements to allow better performance on compute 2.1.

Hi Seibert,

Thank you for your reply. It looks like that deviceQuery relies on the shrUtils.h header file to determine the number of cores per multiprocessor. In my opinion it would be best for the runtime system to report back the number of cores via the cudaDeviceProp structure. It would also be helpful if this structure also reported the type of chip architecture being used, i.e. GF100 or GF104.

I would be interested in hearing more about the synthetic benchmarks you are using. The GF100 and GF104 are definitely different architectures with the GF100 targeting high level-of-detail (tessellation) performance while the GF104 is a very wide SIMD (48-wide) per multiprocessor architecture. The workloads which perform best for these two architectures are going to be different. I have been thinking that the GF100 architecture may be able to effectively address CAD models and complex geometry calculations while the GF104 can just plow through lots and lots of volumetric data.

I would appreciate hearing any thoughts you may have concerning this.

Best,

dpe

Hi Seibert,

Thank you for your reply. It looks like that deviceQuery relies on the shrUtils.h header file to determine the number of cores per multiprocessor. In my opinion it would be best for the runtime system to report back the number of cores via the cudaDeviceProp structure. It would also be helpful if this structure also reported the type of chip architecture being used, i.e. GF100 or GF104.

I would be interested in hearing more about the synthetic benchmarks you are using. The GF100 and GF104 are definitely different architectures with the GF100 targeting high level-of-detail (tessellation) performance while the GF104 is a very wide SIMD (48-wide) per multiprocessor architecture. The workloads which perform best for these two architectures are going to be different. I have been thinking that the GF100 architecture may be able to effectively address CAD models and complex geometry calculations while the GF104 can just plow through lots and lots of volumetric data.

I would appreciate hearing any thoughts you may have concerning this.

Best,

dpe

Agreed. We have asked for such a field in the device properties structure, although until CUDA 3.2 comes out, we won’t know if it was added. You can get the chip architecture from the device properties structure by looking at the compute capability values. GF100 is compute capability 2.0, and GF104 is 2.1. Significant changes to the architecture get a new number, whereas minor updates do not. (ex: the compute capability 1.3 devices had two chip revisions that were functionally identical, so no version number bump.)

I’ll have to dig through the forum to find the benchmarks where someone did finally get full utilization. Most reports you find are the negative case.

However, the basic difference between GF100 and GF104 is the way the instruction dispatchers work. You should think of all CUDA multiprocessors as 32-wide SIMD devices, regardless of number of CUDA cores. Compute capability 1.x can dispatch one simple SIMD instructions every 4 clock cycles. Compute capability 2.0 can dispatch 2 instructions (from different warps) every two cycles. Compute capability 2.1 is more nuanced, though. It can dispatch 2 instructions from different warps every two cycles, but additionally can dispatch another instruction from one of those warps as long as it is independent. The GF104 is a superscalar GPU, in effect, exploiting instruction-level parallelism within a warp to avoid having a third full dispatcher.

Unfortunately, superscalar processors depend to a greater extent on smart programmers and compilers producing an instruction stream that maximizes instruction level parallelism. I think this is why people are having trouble getting full performance out of the GTX 460 at the moment.

Agreed. We have asked for such a field in the device properties structure, although until CUDA 3.2 comes out, we won’t know if it was added. You can get the chip architecture from the device properties structure by looking at the compute capability values. GF100 is compute capability 2.0, and GF104 is 2.1. Significant changes to the architecture get a new number, whereas minor updates do not. (ex: the compute capability 1.3 devices had two chip revisions that were functionally identical, so no version number bump.)

I’ll have to dig through the forum to find the benchmarks where someone did finally get full utilization. Most reports you find are the negative case.

However, the basic difference between GF100 and GF104 is the way the instruction dispatchers work. You should think of all CUDA multiprocessors as 32-wide SIMD devices, regardless of number of CUDA cores. Compute capability 1.x can dispatch one simple SIMD instructions every 4 clock cycles. Compute capability 2.0 can dispatch 2 instructions (from different warps) every two cycles. Compute capability 2.1 is more nuanced, though. It can dispatch 2 instructions from different warps every two cycles, but additionally can dispatch another instruction from one of those warps as long as it is independent. The GF104 is a superscalar GPU, in effect, exploiting instruction-level parallelism within a warp to avoid having a third full dispatcher.

Unfortunately, superscalar processors depend to a greater extent on smart programmers and compilers producing an instruction stream that maximizes instruction level parallelism. I think this is why people are having trouble getting full performance out of the GTX 460 at the moment.

Apparently, the bottleneck is register file bandwidth more than instruction level parallelism.

Apparently, the bottleneck is register file bandwidth more than instruction level parallelism.