Register usage of pow()

brera76 · June 18, 2011, 7:33pm

Hi,

I need the pow function for double values, e.g. pow(double a, int b) in a kernel function but this function needs 24 registers according to the output from visual profiler.

This heavy register usage results in a lower occupancy. Are there any alternatives? The intrinsic function __powf(.,.) does not work for me because of the lower accuracy External Image

thanks in advance!

seibert · June 18, 2011, 9:51pm

Note that the standard power function is actually defined pow(double a, double b), which gives a hint as to why it requires so many registers. Raising a number to a non-integer power is a transcendental function, and thus has a fairly complex implementation.

If your application only requires integer exponents, then you should implement the pow() function yourself. The simplest approach would be a while loop, or possibly a switch statement with repeated multiplication if the maximum exponent is a small number. For larger numbers, you might want to use a more efficient technique like repeated squaring.

njuffa · June 19, 2011, 2:28am

pow() is an overloaded function that already has a pow(double,int) variant, so there is no need to roll your own. If the integer exponent is very small, for example 2 or 3, I would recommend explicit multiplication for best performance. For very large integer exponents the accuracy of the pow(double,int) variant will tend to be lower than the accuracy of the pow(double,double) variant.

The high cost of the pow(double,double) variant in terms of register usage and execution time is a consequence of the excellent accuracy guaranteed across the entire input domain in addition to the handling of numerous special cases according to C99 specifications.

The use of specialized exponentiation functions such as exp(), exp2(), exp10(), sqrt(), rsqrt(), cbrt(), and rcbrt() instead of pow() is advised [where applicable] for performance reasons; in some cases this may also result in slightly better accuracy.

[later:]
A quick look at the implementation of pow(double,int) shows that it includes a reciprocal computation which is called when the exponent is negative. This would increase the overall register usage for the general case. However, when pow(double,int) is called with a positive integer exponent that is a compile-time constant, the call to the reciprocal routine should be optimized out, and the resulting code should require fewer than 24 registers. I will take a look at the generated code when I get a chance.

hamster143 · June 19, 2011, 8:00am

In my tests with sm_21:

pow(double,int) with an arbitrary exponent: 12 registers

pow(double,int) with an exponent that is a large compile-time constant: 10 registers

pow(double,double) with an arbitrary exponent: 20 registers

pow(double,double) with an exponent that is a large compile-time constant: 18 registers

__powf(double,int) (effectively __powf(float,int)) with an arbitrary exponent: 6 registers

brera76 · June 19, 2011, 8:48am

@all: many thanks for your good and professional answers External Image

today I got some new useful information

@hamster 143: thanks for your test, I’ll test it too and post my results

brera76 · June 19, 2011, 9:36am

So, now I tested the pow function with different settings i.e. following kernels and sm_21 as well as sm_20

dim3 dimBlock	(256,1);

	dim3 dimGrid	(2, 128);  

testDoubleIntArbitrary<<<dimGrid, dimBlock>>>(50);

	testDoubleIntConstant<<<dimGrid, dimBlock>>>();

	testDoubleDoubleArbitrary<<<dimGrid, dimBlock>>>(50);

	testDoubleDoubleConstant<<<dimGrid, dimBlock>>>();

__global__ void testDoubleIntArbitrary(int e)

{

	const int tidx = blockDim.x * blockIdx.x + threadIdx.x;//

	pow(3.0, e);

}

__global__ void testDoubleIntConstant()

{

	const int tidx = blockDim.x * blockIdx.x + threadIdx.x;//

	pow(3.0, tidx);

}

__global__ void testDoubleDoubleArbitrary(double e)

{

	const int tidx = blockDim.x * blockIdx.x + threadIdx.x;//

	pow(3.0, e);

}

__global__ void testDoubleDoubleConstant()

{

	const double tidx = blockDim.x * blockIdx.x + threadIdx.x;//

	pow(3.0, tidx);

}

the output of the profiler is, that I need for each kernel 36 register, very strange

The doubledouble versions need also a lot of local loads and local stores

Please see the attached images

njuffa · June 19, 2011, 6:03pm

I am not in front of a CUDA-capable machine at the moment, I will look into these kernels on Monday. Are your observations with respect to the CUDA 4.0 toolchain? What platform are you on? Note that the DoubleIntConstant case does not match the sencario I mentioned, where the exponent is a compile-time constant, since “tidx” is not known until run time. So kernels testDoubleIntArbitrary and testDoubleIntConstant exercise the same code path. An example of the case I referred to would be pow(x,9), where x is a double-precision variable.

brera76 · June 19, 2011, 6:51pm

I have the following setup:

Windows 7
CUDA 4.0
Visual Studio 2008

I have the following graphiccard:

CUDA Device #0
Major revision number: 2
Minor revision number: 1
Name: GeForce GT 425M
Total global memory: 1008271360
Total shared memory per block: 49152
Total registers per block: 32768
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 65535
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 1120000
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 2
Kernel execution timeout: No
Kernel concurrent execution: Yes

thanks for your effort!

njuffa · June 20, 2011, 6:03pm

Here is what I am seeing with CUDA 4.0 on WinXP64 when compiling for sm_21. I used the following two kernels, and established baseline register usage by simply multiplying the arguments.

__global__ void pow_main (struct powParams parms)

{

    int i;

    int totalThreads = gridDim.x * blockDim.x;

    int ctaStart = blockDim.x * blockIdx.x;

    for (i = ctaStart + threadIdx.x; i < parms.n; i += totalThreads) {

#if 0

        parms.res[i] = parms.argx[i] * parms.argy[i];      // baseline

#elif 0

        parms.res[i] = pow (parms.argx[i], 7.0);           // exponent compile-time constant   

#else

        parms.res[i] = pow (parms.argx[i], parms.argy[i]); // exponent variable

#endif

    }

}

__global__ void powint_main (struct powintParams parms)

{

    int i;

    int totalThreads = gridDim.x * blockDim.x;

    int ctaStart = blockDim.x * blockIdx.x;

    for (i = ctaStart + threadIdx.x; i < parms.n; i += totalThreads) {

#if 0

        parms.res[i] = parms.argx[i] * parms.argy[i];      // baseline

#elif 0

        parms.res[i] = pow (parms.argx[i], 7);             // exponent compile-time constant

#else

        parms.res[i] = pow (parms.argx[i], parms.argy[i]); // exponent variable

#endif

    }

}

The register usage reported by PTXAS via -Xptxas -v is as shown below. I also checked the disassembly to see which registers are actually being used.

pow(double,int)

----------------------------------------------

baseline:                         17 registers  

exponent compile-time constant:   17 registers

exponent variable:                24 registers

.

pow(double,double)

----------------------------------------------

baseline:                         15 registers

exponent compile-time constant:   24 registers

exponent variable:                28 registers

Based on this I would consider the register usage of pow() quite reasonable. The register usage would be somewhat lower for all cases when compiling for a 32-bit platform, but I don’t have one to try.

[later:]

Removed the comment about R9, and R11 being unused in the “powint” baseline case. They are used as part of the register pairs R8:R9, R10:R11, each of which holds a 64-bit operand.

Topic		Replies	Views
Register usage of a device function for vector rotation CUDA Programming and Performance	14	702	June 12, 2022
Register usage too high How to reduce register usage? CUDA Programming and Performance	33	7276	December 4, 2011
high register count my kernel uses too many registers CUDA Programming and Performance	6	6028	August 26, 2008
shared memory vs registers CUDA Programming and Performance	19	8055	July 28, 2017
Implement faster cuda intrinsics for specific power functions CUDA Programming and Performance cuda	6	4067	November 4, 2020
Used Registers vs Live Registers CUDA Programming and Performance	14	3389	June 28, 2020
Output code uses far too many registers CUDA Programming and Performance	8	1603	February 5, 2017
cant understand this performance hit CUDA Programming and Performance	23	4597	May 17, 2010
call stack/ stack frame CUDA Programming and Performance	28	9128	October 10, 2014
High execution dependency and register usage CUDA Programming and Performance	10	2261	June 27, 2017

Register usage of pow()

Related topics