Problem with ^(1/4)

I want to calculate the fourth square root of all my elements, but I get very strange results.

T = sqrt(T);

works ok but

T = sqrt(T);
T = sqrt(T);


T = pow(T,0.25f);

does not work at all. Do I have to use double precision for these calculations or can I solve it in some other way?

Try powf for single precision numbers.

They should work fine.

But you’re also using double precision calls (but not variables) now. You may want to use sqrtf() and powf(). But you should still be getting the fourth root even with the double precision calls (which would be casted up to doubles then back down to floats).

So what’s your “strange result” that “does not work at all”? It’s unlikely that those two methods you list would fail so badly, especially if a single sqrt() is OK.

No difference if I use sqrtf or powf instead. With a double sqrt all my values are at the end of the algorithm, after converting to uint16, either 0 or 65536.

So your only judge of whether these functions are working is by looking at some uint16 conversion or cast of the results? This

#include <stdio.h>

__global__ void kernel(void)


	float val = 1.2345e6f;

	float exponent = 0.25f;

	for(int i=0; i<4; i++) {

		val = powf(val,exponent);

		printf("%d: %f\n", i, val);



int main(void)





does exactly what one would expect:

avidday@cuda:~$ nvcc -arch=sm_20 -Xptxas="-v" 

ptxas info    : Compiling entry function '_Z6kernelv' for 'sm_20'

ptxas info    : Function properties for _Z6kernelv

    16 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info    : Used 17 registers, 32 bytes cmem[0], 4 bytes cmem[14], 36 bytes cmem[16]

avidday@cuda:~$ ./a.out 

0: 33.332874

1: 2.402803

2: 1.245029

3: 1.056319

Have you considered that something else might not be correct in the code, perhaps?

Yes of course, but since it seems related to the double square root I wanted to check it first.

Introducing sqrt and pow, especially multiple repetitions of the double precision versions, will greatly increase the register count of the compiled kernel. Are you really sure that the kernel is actually running? It could be that if you haven’t adjusted your block size, that the kernel is failing to launch.

Problem solved, I made a mistake prior to the sqrt’s.

In terms of performance, you’d want to use nested calls to sqrt() instead of pow() to compute ^(1/4). pow() is quite expensive since it needs to handle many special cases and also needs additional arithmetic operations to ensure good accuracy across all combinations of arguments.

The performance is not the major issue in this case, now I want to calculate ^(1/16) but 32 bit floats seem insufficient for this.

What do you mean by “insufficient”? Accuracy? Range? What magnitude numbers are you trying to compute with?