Number of Register vs different architecture

Dear all,

I have a code (all the variable are defined as float single precision) that run on three differents GPU:

  1. G210M (laptop) (capability 1.3)
  2. GTX 460 (workstation) (capability 2.1)
  3. Tesla C2070 (workstation at work) (capability 2.0)

The code is compiled, on each GPU, with the appropiate nvcc flag (-arch sm_13, -arch sm_20) and the flag -Xptxas="-v", in order to know the number of the register per thread, but why this number is not the same?
Result from compiler:

  • G210M number of the register per thread 29
  • GTX 460 number of the register per thread 37
  • Tesla C2070 number of the register per thread 55

Since the code is the same I should aspect to have the same number of the register per thread even if the GPUs are not the same .

Thank you.

Dear all,

I have a code (all the variable are defined as float single precision) that run on three differents GPU:

  1. G210M (laptop) (capability 1.3)
  2. GTX 460 (workstation) (capability 2.1)
  3. Tesla C2070 (workstation at work) (capability 2.0)

The code is compiled, on each GPU, with the appropiate nvcc flag (-arch sm_13, -arch sm_20) and the flag -Xptxas="-v", in order to know the number of the register per thread, but why this number is not the same?
Result from compiler:

  • G210M number of the register per thread 29
  • GTX 460 number of the register per thread 37
  • Tesla C2070 number of the register per thread 55

Since the code is the same I should aspect to have the same number of the register per thread even if the GPUs are not the same .

Thank you.

Compute capability 1.x and 2.x are two quite different architectures, so it is not surprising they use different numbers of registers. Compute capability 2.0 and 2.1 have no user-visible changes though, so the different numbers seem strange. Are you running different CUDA versions on the two machines?

Compute capability 1.x and 2.x are two quite different architectures, so it is not surprising they use different numbers of registers. Compute capability 2.0 and 2.1 have no user-visible changes though, so the different numbers seem strange. Are you running different CUDA versions on the two machines?

The CUDA version are the same, the drivers are differents. Sorry, but why between the 2.x ans 1.x the difference are almost the double?
It is for optimizations/performance of nvcc in 2.x that the number of registers is double.
Thank you.

Michele

The CUDA version are the same, the drivers are differents. Sorry, but why between the 2.x ans 1.x the difference are almost the double?
It is for optimizations/performance of nvcc in 2.x that the number of registers is double.
Thank you.

Michele

Are you compiling for 32 or 64 bit? On 64 bit, pointers take up two registers. Also 2.x introduced an ABI for function calls etc. that unfortunately increases register use quite a lot.

Are you compiling for 32 or 64 bit? On 64 bit, pointers take up two registers. Also 2.x introduced an ABI for function calls etc. that unfortunately increases register use quite a lot.

In general, the increase in register use from the use of the ABI should be minor (often just one register). Prior to CUDA 4.0 there were still a few rough edges regarding the use of the ABI which in some cases drove up the register usage significantly. Best I know, all known issues of this nature were fixed for CUDA 4.0 at the latest.

As has been pointed out above, one should expect to see differences in register usage based on host platform and GPU architecture.

The sm_2x architecture made various changes compared to the sm_1x architecture that tend to increase the use of general purpose registers. In particular it is more of a strict load-store architecture, and it removed separate address registers. Also, with sm_2x IEEE-compliant single-precision division and square root became the default implementations. The subroutines that implement this functionality require additional registers over the approximate implementations used on sm_1x architecture GPUs. Note that programmers can opt out of the IEEE-compliant single-precision division and square root by use of the compiler switches -prec-div=false, -prec_sqrt=false.

In terms of host platform, it has always been the case that on 64-bit host platforms both the host and the device side code use 64-bit pointers, and both use 64-bit “long” on host platforms where “long” is a 64-bit type. This provides storage consisteny across the host-device boundary, which is useful for passing structs, for example. Consequently, more registers may be required for device code when the build target is a 64-bit platform, compared to the equivalent 32-bit platform. For sm_1x devices, the compiler applies various optimizations based on the fact that no sm_1x GPU provides more than 4GB of storage, keeping the increase in register use between 32-bit and 64-bit platforms to a minimum. Since there are sm_2x devices with more than 4GB of memory, most of these optimizations are no longer applicable on sm_2x device. On a 64-bit platform code compiled for sm_2x therefore tends to use more registers then the same code built for sm_1x. The increase in register usage can be significant if many pointers are used in the code.

In general, the increase in register use from the use of the ABI should be minor (often just one register). Prior to CUDA 4.0 there were still a few rough edges regarding the use of the ABI which in some cases drove up the register usage significantly. Best I know, all known issues of this nature were fixed for CUDA 4.0 at the latest.

As has been pointed out above, one should expect to see differences in register usage based on host platform and GPU architecture.

The sm_2x architecture made various changes compared to the sm_1x architecture that tend to increase the use of general purpose registers. In particular it is more of a strict load-store architecture, and it removed separate address registers. Also, with sm_2x IEEE-compliant single-precision division and square root became the default implementations. The subroutines that implement this functionality require additional registers over the approximate implementations used on sm_1x architecture GPUs. Note that programmers can opt out of the IEEE-compliant single-precision division and square root by use of the compiler switches -prec-div=false, -prec_sqrt=false.

In terms of host platform, it has always been the case that on 64-bit host platforms both the host and the device side code use 64-bit pointers, and both use 64-bit “long” on host platforms where “long” is a 64-bit type. This provides storage consisteny across the host-device boundary, which is useful for passing structs, for example. Consequently, more registers may be required for device code when the build target is a 64-bit platform, compared to the equivalent 32-bit platform. For sm_1x devices, the compiler applies various optimizations based on the fact that no sm_1x GPU provides more than 4GB of storage, keeping the increase in register use between 32-bit and 64-bit platforms to a minimum. Since there are sm_2x devices with more than 4GB of memory, most of these optimizations are no longer applicable on sm_2x device. On a 64-bit platform code compiled for sm_2x therefore tends to use more registers then the same code built for sm_1x. The increase in register usage can be significant if many pointers are used in the code.

Thanks njuffa.
Your explanation is clear and exhaustive.

Thanks.

Thanks njuffa.
Your explanation is clear and exhaustive.

Thanks.