I’d like to declare a host version of __fmul_rn to test some functions that should work both on the host and the device. Are there preprocessor defines that can be used to determine whether host or device code is being compiled? Is there another way to do this?
EDIT: In general, I’m looking for a way to have functions that are both host and device functions but do slightly different things depending on whether they are host or device functions. (e.g. a * b on the host and __fmul_rn(a, b) on the device).
I don’t think you could do that, out of principle. Your function is either declared host or device, and that’s it. If you compile Release or Emu you’ll still get errors like “can’t call device from host” etc. You’ll need two functions. (A related question: can you declare two functions with the same name and different class, and have the compiler automatically find the right one?)
Now if you’re talking about macros… those don’t get checked by the compiler to be either device or host. But their contents also can’t be ifdef’d to do fancy things, afaik.
if you only need this for floating-point multiply, I guess you can try to “hack” the compiler (well, this is not a proper solution in all cases but…)
the point is that normally a * b generates ‘mul.f32’ assembly instruction (i.e. without rounding modifier), and PTX manual says that:
“A mul instruction with no rounding modifier defaults to round-to-nearesteven and may be optimized aggressively by the code optimizer”
in other words it can be fused into single mad.rn…
To avoid this you can alter the nvopencc compiler such that it expands a floating-point single multiplication, say to mul.rn.f32 (ie with explicit rounding mode),
hence no optimization will take place in ‘ptxas’ phase.
Or alternatively you can somehow preprocess generated ptx sources (prior to assembly compilation) to replace
all occurrences of ‘mul.f32’ to ‘mul.rn.f32’. Well, I do not know if this is easily realizable but probably makes sense to try
You can have a look at how it is done in CUDA headers in the cuda/include directory. The CUDABE macro is defined when the code is compiled using the CUDA back-end, so you can use ifdefCUDABE to select the correct path.
It’s ugly and likely not supported, but much easier than hacking nvopencc…
Could you make me a favor? :)
Please, don’t call this an FMA… The FMA operator is precisely defined in the IEEE 754-2008 standard and is something quite different. From an arithmetic viewpoint, the Tesla SP MAD is definitely an unfused multiply-add (and further the multiplication is rounded toward zero). Whether it’s implemented with 1 or 2 instructions in hardware is an implementation detail…
So I suggest calling this operator an UMA instead, as in the Sun and Fujitsu product literature…