How to declare a __host__ fmul_rn? __device__ and __host__ defines?

I’d like to declare a host version of __fmul_rn to test some functions that should work both on the host and the device. Are there preprocessor defines that can be used to determine whether host or device code is being compiled? Is there another way to do this?

EDIT: In general, I’m looking for a way to have functions that are both host and device functions but do slightly different things depending on whether they are host or device functions. (e.g. a * b on the host and __fmul_rn(a, b) on the device).

I don’t think you could do that, out of principle. Your function is either declared host or device, and that’s it. If you compile Release or Emu you’ll still get errors like “can’t call device from host” etc. You’ll need two functions. (A related question: can you declare two functions with the same name and different class, and have the compiler automatically find the right one?)

Now if you’re talking about macros… those don’t get checked by the compiler to be either device or host. But their contents also can’t be ifdef’d to do fancy things, afaik.

Consider this example function:

__device__ __host__ float len(float x, float y) {

    return sqrtf(x * x + y * y);

}

That’s perfectly legal, but you might get an unwanted FMAD in the device function. Instead, I’d like to be able to write something like this:

#ifdef DEVICE

#define MUL(x, y) __fmul_rn(x, y)

#else

#define MUL(x, y) x * y

#endif

__device__ __host__ float len(float x, float y) {

    return sqrtf(MUL(x, x) + MUL(y, y));

}

Me too! I’ve just run into the very same problem. Did you find any workaround?

In particular I have an overloaded operator that is currently device only, because it uses __fmul__rn.

However I have an urgent need to make this operator also work on the host.

So I would love to generate different code, depending on whether we’re executing the function on the host

or on the device.

Is there any way to do that?

Overloading the operator twice, once with host and once with device is rejected by the compiler.

It says the operator has already been overloaded.

Christian

No, I never found a work around. I’ve requested it a couple times in the feature request threads, but no response has been forthcoming so far.

Hi,

if you only need this for floating-point multiply, I guess you can try to “hack” the compiler (well, this is not a proper solution in all cases but…)

the point is that normally a * b generates ‘mul.f32’ assembly instruction (i.e. without rounding modifier), and PTX manual says that:

“A mul instruction with no rounding modifier defaults to round-to-nearesteven and may be optimized aggressively by the code optimizer”

in other words it can be fused into single mad.rn…

To avoid this you can alter the nvopencc compiler such that it expands a floating-point single multiplication, say to mul.rn.f32 (ie with explicit rounding mode),

hence no optimization will take place in ‘ptxas’ phase.

Or alternatively you can somehow preprocess generated ptx sources (prior to assembly compilation) to replace

all occurrences of ‘mul.f32’ to ‘mul.rn.f32’. Well, I do not know if this is easily realizable but probably makes sense to try

well generally I need the fused multiply add → more GFlops is good

Just in the division and multiplication code of the “double single” floating point emulation FMA hurts accuracy more than elsewhere.

You can have a look at how it is done in CUDA headers in the cuda/include directory. The CUDABE macro is defined when the code is compiled using the CUDA back-end, so you can use ifdef CUDABE to select the correct path.

It’s ugly and likely not supported, but much easier than hacking nvopencc…

Could you make me a favor? :)

Please, don’t call this an FMA… The FMA operator is precisely defined in the IEEE 754-2008 standard and is something quite different. From an arithmetic viewpoint, the Tesla SP MAD is definitely an unfused multiply-add (and further the multiplication is rounded toward zero). Whether it’s implemented with 1 or 2 instructions in hardware is an implementation detail…

So I suggest calling this operator an UMA instead, as in the Sun and Fujitsu product literature…

Thanks. ;)

Yes, good idea. I like UMA very much. Better than NRFMAWIRTZ (not really fused multiply add with intermediate rounding to zero)

External Image

Thats a real good one, Christian :)

+1 from me ;)