I need the pow function for double values, e.g. pow(double a, int b) in a kernel function but this function needs 24 registers according to the output from visual profiler.
This heavy register usage results in a lower occupancy. Are there any alternatives? The intrinsic function __powf(.,.) does not work for me because of the lower accuracy External Image
Note that the standard power function is actually defined pow(double a, double b), which gives a hint as to why it requires so many registers. Raising a number to a non-integer power is a transcendental function, and thus has a fairly complex implementation.
If your application only requires integer exponents, then you should implement the pow() function yourself. The simplest approach would be a while loop, or possibly a switch statement with repeated multiplication if the maximum exponent is a small number. For larger numbers, you might want to use a more efficient technique like repeated squaring.
pow() is an overloaded function that already has a pow(double,int) variant, so there is no need to roll your own. If the integer exponent is very small, for example 2 or 3, I would recommend explicit multiplication for best performance. For very large integer exponents the accuracy of the pow(double,int) variant will tend to be lower than the accuracy of the pow(double,double) variant.
The high cost of the pow(double,double) variant in terms of register usage and execution time is a consequence of the excellent accuracy guaranteed across the entire input domain in addition to the handling of numerous special cases according to C99 specifications.
The use of specialized exponentiation functions such as exp(), exp2(), exp10(), sqrt(), rsqrt(), cbrt(), and rcbrt() instead of pow() is advised [where applicable] for performance reasons; in some cases this may also result in slightly better accuracy.
[later:]
A quick look at the implementation of pow(double,int) shows that it includes a reciprocal computation which is called when the exponent is negative. This would increase the overall register usage for the general case. However, when pow(double,int) is called with a positive integer exponent that is a compile-time constant, the call to the reciprocal routine should be optimized out, and the resulting code should require fewer than 24 registers. I will take a look at the generated code when I get a chance.
I am not in front of a CUDA-capable machine at the moment, I will look into these kernels on Monday. Are your observations with respect to the CUDA 4.0 toolchain? What platform are you on? Note that the DoubleIntConstant case does not match the sencario I mentioned, where the exponent is a compile-time constant, since “tidx” is not known until run time. So kernels testDoubleIntArbitrary and testDoubleIntConstant exercise the same code path. An example of the case I referred to would be pow(x,9), where x is a double-precision variable.
CUDA Device #0
Major revision number: 2
Minor revision number: 1
Name: GeForce GT 425M
Total global memory: 1008271360
Total shared memory per block: 49152
Total registers per block: 32768
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 65535
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1120000
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 2
Kernel execution timeout: No
Kernel concurrent execution: Yes
Here is what I am seeing with CUDA 4.0 on WinXP64 when compiling for sm_21. I used the following two kernels, and established baseline register usage by simply multiplying the arguments.
__global__ void pow_main (struct powParams parms)
{
int i;
int totalThreads = gridDim.x * blockDim.x;
int ctaStart = blockDim.x * blockIdx.x;
for (i = ctaStart + threadIdx.x; i < parms.n; i += totalThreads) {
#if 0
parms.res[i] = parms.argx[i] * parms.argy[i]; // baseline
#elif 0
parms.res[i] = pow (parms.argx[i], 7.0); // exponent compile-time constant
#else
parms.res[i] = pow (parms.argx[i], parms.argy[i]); // exponent variable
#endif
}
}
__global__ void powint_main (struct powintParams parms)
{
int i;
int totalThreads = gridDim.x * blockDim.x;
int ctaStart = blockDim.x * blockIdx.x;
for (i = ctaStart + threadIdx.x; i < parms.n; i += totalThreads) {
#if 0
parms.res[i] = parms.argx[i] * parms.argy[i]; // baseline
#elif 0
parms.res[i] = pow (parms.argx[i], 7); // exponent compile-time constant
#else
parms.res[i] = pow (parms.argx[i], parms.argy[i]); // exponent variable
#endif
}
}
The register usage reported by PTXAS via -Xptxas -v is as shown below. I also checked the disassembly to see which registers are actually being used.
Based on this I would consider the register usage of pow() quite reasonable. The register usage would be somewhat lower for all cases when compiling for a 32-bit platform, but I don’t have one to try.
[later:]
Removed the comment about R9, and R11 being unused in the “powint” baseline case. They are used as part of the register pairs R8:R9, R10:R11, each of which holds a 64-bit operand.