__powf(x,y) gives nan

Hallo, i have a program with following lines inside it:

int propDirS;
cuSingleComplex kz;

...reading value of kz from global memory...

for(propDirS=0;propDirS<2;++propDirS){
...
kz=kz*__powf(__int2float_rd(-1),__int2float_rz(propDirS))
...
}

, in this case I get kz=nan. If I use powf() function instead of the intrinsic function __powf(), I get the correct result. Where is a problem?

Thanks in advance,

Dalibor

What is the point of pow of -1?

The __powf function is implemented as exp2f(y * __log2f(x)).
It will give a Nan for a negative x.

Hi Lev,
I just wanted to use this function in my code for modification of a variable in a loop.

Dalibor

Thanks, thats the reason!

Dalibor

If I understand the intentions of your code correctly, you have a variable propDirS that can take the value 0 or 1. When propDirS is 0, the variable kz should remains unchanged, if it is 1, kz should be negated. If this is a correct interpretation, why not use the straightforward code:

kz = propDirS ? -kz : kz;

If the multiplication is required because of some desired side-effect (for example, performing a flush to zero operation on kz to get rid of denormals that may be present in the input), you could use

kz = kz * (1.0f - 2.0f * propDirS);

I didn’t know cuComplex supports an overloaded multiplication operator that multiplies a float with a cuSingleComplex. I would have thought that a component-wise operation is required.

I just wanted to avoid the if statement, but it was not necessary. I was curious about the nan result, which was unexpected for me. Thank you for suggestion of function

kz = kz * (1.0f - 2.0f * propDirS);

cuSingleComplex is a struct, that was written by colleague of mine. There are as well functions and operators that perform operations of this struct with numbers of other types.

Thanks,

Dalibor

I just wanted to avoid the if statement, but it was not necessary. I was curious about the nan result, which was unexpected for me. Thank you for suggestion of function

kz = kz * (1.0f - 2.0f * propDirS);

cuSingleComplex is a struct, that was written by colleague of mine. There are as well functions and operators that perform operations on this struct with numbers of other types.

Thanks,

Dalibor

It seems I mistook cuSingleComplex for the corresponding data type from the CUDA header file cuComplex.h. If you use your own struct, make sure you use the appropriate align directive to get the benefit of wide loads.

Small, local if-statements are not something CUDA programmers should worry about, and I would not advise to use manual replacement by clever arithmetic expressions to avoid a potential branch. This makes code harder to understand and falls under the general topic of “premature optimization”. Unless there is an important reason to do otherwise, I would recommend writing CUDA code in a clear, natural style.

The GPU hardware offers predicated execution for almost all instructions, provides “select”-type instructions (the directl equivalent to the ternary operator in C/C++), and other optimizations such as a uniform branch.

The compiler will most likely turn a simple assignment via ternary operator into a select-type instruction. It usually compiles very small if-statements into an inline sequence of predicated instructions, and larger if-statements into a combination of predicated instructions and uniform branch.

By wide loads you mean coalesced memory access? Yes, I have taken care of it. Thanks for advise.

If a I understand it well, the predicated execution is a method how to get rid of thread divergence. For example, if there is a one if statement which would obviously cause thread divergence, the compiler will transform this statement into an inline sequence of predicated instructions. I thought that there is not any way how to avoid thread divergence for divergent if statements.

By wide load I mean that you would want a cuSingleComplex to be loaded with one 64-bit load, not two 32-bit loads. Since data on the GPU must be naturally aligned, your struct needs to be 8-byte aligned. One way of accomplishing that is to map the complex data to a built-in CUDA type such as a float2 or double2, which are already aligned. If you use a completely custom struct, you would want to use the align attribute, like so:

typedef struct align(8) { float re; float im; } cuSingleComplex;

Divergent execution can have a negative performance impact on GPUs. However the compiler does a pretty good job at using the mechanism provides by the hardware to mitigate the issue, so programmers should not go out of their way to avoid branches. Use the profiler to guide your optimization effort. In many cases you will find that things other than branch divergence are the bottlenecks in the code.

Thanks for recommendation of the profiler. May I use the align statement in this way?

struct __align__(8) cuSingleComplex {

    float   r;
    float   i;
};