How to declare a host fmul_rn? device and host defines?

StickGuy · September 1, 2008, 12:44pm

I’d like to declare a host version of __fmul_rn to test some functions that should work both on the host and the device. Are there preprocessor defines that can be used to determine whether host or device code is being compiled? Is there another way to do this?

EDIT: In general, I’m looking for a way to have functions that are both host and device functions but do slightly different things depending on whether they are host or device functions. (e.g. a * b on the host and __fmul_rn(a, b) on the device).

alex_dubinsky · September 1, 2008, 8:00pm

I don’t think you could do that, out of principle. Your function is either declared host or device, and that’s it. If you compile Release or Emu you’ll still get errors like “can’t call device from host” etc. You’ll need two functions. (A related question: can you declare two functions with the same name and different class, and have the compiler automatically find the right one?)

Now if you’re talking about macros… those don’t get checked by the compiler to be either device or host. But their contents also can’t be ifdef’d to do fancy things, afaik.

StickGuy · September 1, 2008, 8:41pm

Consider this example function:

__device__ __host__ float len(float x, float y) {

    return sqrtf(x * x + y * y);

}

That’s perfectly legal, but you might get an unwanted FMAD in the device function. Instead, I’d like to be able to write something like this:

#ifdef DEVICE

#define MUL(x, y) __fmul_rn(x, y)

#else

#define MUL(x, y) x * y

#endif

__device__ __host__ float len(float x, float y) {

    return sqrtf(MUL(x, x) + MUL(y, y));

}

cbuchner1 · October 18, 2009, 7:45pm

Consider this example function:
__device__ __host__ float len(float x, float y) {

 Â  Â return sqrtf(x * x + y * y);

}
That’s perfectly legal, but you might get an unwanted FMAD in the device function. Instead, I’d like to be able to write something like this:
#ifdef DEVICE

#define MUL(x, y) __fmul_rn(x, y)

#else

#define MUL(x, y) x * y

#endif

__device__ __host__ float len(float x, float y) {

 Â  Â return sqrtf(MUL(x, x) + MUL(y, y));

}

Me too! I’ve just run into the very same problem. Did you find any workaround?

In particular I have an overloaded operator that is currently device only, because it uses __fmul__rn.

However I have an urgent need to make this operator also work on the host.

So I would love to generate different code, depending on whether we’re executing the function on the host

or on the device.

Is there any way to do that?

Overloading the operator twice, once with host and once with device is rejected by the compiler.

It says the operator has already been overloaded.

Christian

StickGuy · October 19, 2009, 7:46am

No, I never found a work around. I’ve requested it a couple times in the feature request threads, but no response has been forthcoming so far.

asm · October 19, 2009, 11:05am

Hi,

if you only need this for floating-point multiply, I guess you can try to “hack” the compiler (well, this is not a proper solution in all cases but…)

the point is that normally a * b generates ‘mul.f32’ assembly instruction (i.e. without rounding modifier), and PTX manual says that:

“A mul instruction with no rounding modifier defaults to round-to-nearesteven and may be optimized aggressively by the code optimizer”

in other words it can be fused into single mad.rn…

To avoid this you can alter the nvopencc compiler such that it expands a floating-point single multiplication, say to mul.rn.f32 (ie with explicit rounding mode),

hence no optimization will take place in ‘ptxas’ phase.

Or alternatively you can somehow preprocess generated ptx sources (prior to assembly compilation) to replace

all occurrences of ‘mul.f32’ to ‘mul.rn.f32’. Well, I do not know if this is easily realizable but probably makes sense to try

cbuchner1 · October 19, 2009, 11:47am

well generally I need the fused multiply add → more GFlops is good

Just in the division and multiplication code of the “double single” floating point emulation FMA hurts accuracy more than elsewhere.

Sylvain_Collange · October 20, 2009, 8:37am

You can have a look at how it is done in CUDA headers in the cuda/include directory. The CUDABE macro is defined when the code is compiled using the CUDA back-end, so you can use ifdef CUDABE to select the correct path.

It’s ugly and likely not supported, but much easier than hacking nvopencc…

Could you make me a favor? :)

Please, don’t call this an FMA… The FMA operator is precisely defined in the IEEE 754-2008 standard and is something quite different. From an arithmetic viewpoint, the Tesla SP MAD is definitely an unfused multiply-add (and further the multiplication is rounded toward zero). Whether it’s implemented with 1 or 2 instructions in hardware is an implementation detail…

So I suggest calling this operator an UMA instead, as in the Sun and Fujitsu product literature…

Thanks. ;)

cbuchner1 · October 20, 2009, 1:47pm

Yes, good idea. I like UMA very much. Better than NRFMAWIRTZ (not really fused multiply add with intermediate rounding to zero)

External Image

eyalhir74 · October 20, 2009, 3:16pm

Thats a real good one, Christian :)

+1 from me ;)

Topic		Replies	Views
define whether compiled for host or device CUDA Programming and Performance	4	4282	August 19, 2009
Overloading __host__ and __device__ function CUDA Programming and Performance	2	3928	May 30, 2013
Declare a function for both device and host? CUDA Programming and Performance	3	1938	January 24, 2011
Combined __host__ __device__ functions How to tell if it is device or host executing? CUDA Programming and Performance	9	19797	December 10, 2009
Overloading between device and host? CUDA Programming and Performance	2	957	October 10, 2013
Different Code for Host and Device CUDA Programming and Performance	3	1078	August 7, 2010
How to call Host functions on device CUDA Programming and Performance	3	1423	December 11, 2009
same fn host and device with conditional compilation CUDA Programming and Performance	4	1487	September 13, 2011
sharing function between Host and Device CUDA Programming and Performance	3	2850	September 24, 2009
Mixing device and host code is not nice CUDA Programming and Performance	2	1326	February 6, 2013

How to declare a __host__ fmul_rn? __device__ and __host__ defines?

Related topics

How to declare a host fmul_rn? device and host defines?