Register usage of pow()

Hi,

I need the pow function for double values, e.g. pow(double a, int b) in a kernel function but this function needs 24 registers according to the output from visual profiler.

This heavy register usage results in a lower occupancy. Are there any alternatives? The intrinsic function __powf(.,.) does not work for me because of the lower accuracy External Image

thanks in advance!

Note that the standard power function is actually defined pow(double a, double b), which gives a hint as to why it requires so many registers. Raising a number to a non-integer power is a transcendental function, and thus has a fairly complex implementation.

If your application only requires integer exponents, then you should implement the pow() function yourself. The simplest approach would be a while loop, or possibly a switch statement with repeated multiplication if the maximum exponent is a small number. For larger numbers, you might want to use a more efficient technique like repeated squaring.

pow() is an overloaded function that already has a pow(double,int) variant, so there is no need to roll your own. If the integer exponent is very small, for example 2 or 3, I would recommend explicit multiplication for best performance. For very large integer exponents the accuracy of the pow(double,int) variant will tend to be lower than the accuracy of the pow(double,double) variant.

The high cost of the pow(double,double) variant in terms of register usage and execution time is a consequence of the excellent accuracy guaranteed across the entire input domain in addition to the handling of numerous special cases according to C99 specifications.

The use of specialized exponentiation functions such as exp(), exp2(), exp10(), sqrt(), rsqrt(), cbrt(), and rcbrt() instead of pow() is advised [where applicable] for performance reasons; in some cases this may also result in slightly better accuracy.

[later:]
A quick look at the implementation of pow(double,int) shows that it includes a reciprocal computation which is called when the exponent is negative. This would increase the overall register usage for the general case. However, when pow(double,int) is called with a positive integer exponent that is a compile-time constant, the call to the reciprocal routine should be optimized out, and the resulting code should require fewer than 24 registers. I will take a look at the generated code when I get a chance.

In my tests with sm_21:

pow(double,int) with an arbitrary exponent: 12 registers

pow(double,int) with an exponent that is a large compile-time constant: 10 registers

pow(double,double) with an arbitrary exponent: 20 registers

pow(double,double) with an exponent that is a large compile-time constant: 18 registers

__powf(double,int) (effectively __powf(float,int)) with an arbitrary exponent: 6 registers

@all: many thanks for your good and professional answers External Image

today I got some new useful information

@hamster 143: thanks for your test, I’ll test it too and post my results

So, now I tested the pow function with different settings i.e. following kernels and sm_21 as well as sm_20

dim3 dimBlock	(256,1);

	dim3 dimGrid	(2, 128);  

testDoubleIntArbitrary<<<dimGrid, dimBlock>>>(50);

	testDoubleIntConstant<<<dimGrid, dimBlock>>>();

	testDoubleDoubleArbitrary<<<dimGrid, dimBlock>>>(50);

	testDoubleDoubleConstant<<<dimGrid, dimBlock>>>();

__global__ void testDoubleIntArbitrary(int e)

{

	const int tidx = blockDim.x * blockIdx.x + threadIdx.x;//

	pow(3.0, e);

}

__global__ void testDoubleIntConstant()

{

	const int tidx = blockDim.x * blockIdx.x + threadIdx.x;//

	pow(3.0, tidx);

}

__global__ void testDoubleDoubleArbitrary(double e)

{

	const int tidx = blockDim.x * blockIdx.x + threadIdx.x;//

	pow(3.0, e);

}

__global__ void testDoubleDoubleConstant()

{

	const double tidx = blockDim.x * blockIdx.x + threadIdx.x;//

	pow(3.0, tidx);

}

the output of the profiler is, that I need for each kernel 36 register, very strange

The doubledouble versions need also a lot of local loads and local stores

Please see the attached images


I am not in front of a CUDA-capable machine at the moment, I will look into these kernels on Monday. Are your observations with respect to the CUDA 4.0 toolchain? What platform are you on? Note that the DoubleIntConstant case does not match the sencario I mentioned, where the exponent is a compile-time constant, since “tidx” is not known until run time. So kernels testDoubleIntArbitrary and testDoubleIntConstant exercise the same code path. An example of the case I referred to would be pow(x,9), where x is a double-precision variable.

I have the following setup:

Windows 7
CUDA 4.0
Visual Studio 2008

I have the following graphiccard:

CUDA Device #0
Major revision number: 2
Minor revision number: 1
Name: GeForce GT 425M
Total global memory: 1008271360
Total shared memory per block: 49152
Total registers per block: 32768
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 65535
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1120000
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 2
Kernel execution timeout: No
Kernel concurrent execution: Yes

thanks for your effort!

Here is what I am seeing with CUDA 4.0 on WinXP64 when compiling for sm_21. I used the following two kernels, and established baseline register usage by simply multiplying the arguments.

__global__ void pow_main (struct powParams parms)

{

    int i;

    int totalThreads = gridDim.x * blockDim.x;

    int ctaStart = blockDim.x * blockIdx.x;

    for (i = ctaStart + threadIdx.x; i < parms.n; i += totalThreads) {

#if 0

        parms.res[i] = parms.argx[i] * parms.argy[i];      // baseline

#elif 0

        parms.res[i] = pow (parms.argx[i], 7.0);           // exponent compile-time constant   

#else

        parms.res[i] = pow (parms.argx[i], parms.argy[i]); // exponent variable

#endif

    }

}

__global__ void powint_main (struct powintParams parms)

{

    int i;

    int totalThreads = gridDim.x * blockDim.x;

    int ctaStart = blockDim.x * blockIdx.x;

    for (i = ctaStart + threadIdx.x; i < parms.n; i += totalThreads) {

#if 0

        parms.res[i] = parms.argx[i] * parms.argy[i];      // baseline

#elif 0

        parms.res[i] = pow (parms.argx[i], 7);             // exponent compile-time constant

#else

        parms.res[i] = pow (parms.argx[i], parms.argy[i]); // exponent variable

#endif

    }

}

The register usage reported by PTXAS via -Xptxas -v is as shown below. I also checked the disassembly to see which registers are actually being used.

pow(double,int)

----------------------------------------------

baseline:                         17 registers  

exponent compile-time constant:   17 registers

exponent variable:                24 registers

.

pow(double,double)

----------------------------------------------

baseline:                         15 registers

exponent compile-time constant:   24 registers

exponent variable:                28 registers

Based on this I would consider the register usage of pow() quite reasonable. The register usage would be somewhat lower for all cases when compiling for a 32-bit platform, but I don’t have one to try.

[later:]

Removed the comment about R9, and R11 being unused in the “powint” baseline case. They are used as part of the register pairs R8:R9, R10:R11, each of which holds a 64-bit operand.