But you’re also using double precision calls (but not variables) now. You may want to use sqrtf() and powf(). But you should still be getting the fourth root even with the double precision calls (which would be casted up to doubles then back down to floats).
So what’s your “strange result” that “does not work at all”? It’s unlikely that those two methods you list would fail so badly, especially if a single sqrt() is OK.
No difference if I use sqrtf or powf instead. With a double sqrt all my values are at the end of the algorithm, after converting to uint16, either 0 or 65536.
Introducing sqrt and pow, especially multiple repetitions of the double precision versions, will greatly increase the register count of the compiled kernel. Are you really sure that the kernel is actually running? It could be that if you haven’t adjusted your block size, that the kernel is failing to launch.
In terms of performance, you’d want to use nested calls to sqrt() instead of pow() to compute ^(1/4). pow() is quite expensive since it needs to handle many special cases and also needs additional arithmetic operations to ensure good accuracy across all combinations of arguments.