nvopencc backend with 'addc' intrinsics new builtins for 32-bit CUDA compiler under Linux

Hi,

let me introduce my experimental nvopencc compiler with addc intrinsics support ;)
ATTENTION: this is for 32-bit Linux users !!

New features are summarized as follows:

  • all 32-bit integer add instructions (signed/unsigned) are replaced by “carry-out” versions, i.e. add.u32/s32 are replaced by add.cc.u32/s32
    (otherwise there is no way to use carry-out)
  • added two new builtins: __addc( a, b ) and __uaddc( a, b ) for signed and unsigned additions-with-carry respectively - they are mapped directly to addc.cc.u32/s32 instructions

I also wanted to add parallel reductions but there are some subtleties involved in managing global/shared memory pointers which I do not quite
understand, so maybe this will be added later

Although I tested this compiler with my kernels and it works well, please beware that this features are still fully experimental,
so if you want to try it out, do it on your own risk !

There is also one major drawback: for some reason open64
trunk’s version does not expand floating-point divisions, so attempting to use floating-point divisions would trigger an assertion,
something like: “Floating-point division is not yet implemented…”
On the other hand, GPU does not have native division, so it was implemented as a slow local subroutine…

Anyway if you want to try this out, installation is very simple: unpack NVOPENCC archive,
copy ‘be’ (compiler back-end), ‘gfec’ (gcc front-end) and ‘inline’ into /path/to/cuda/open64/lib,
‘nvopencc’ to /path/to/cuda/open64/bin, include ‘ext_intrinsics.h’ from your code or some library file.

And the last: add ‘/path/to/cuda/open64/lib’ to your PATH variable (I didn’t figure out completely how nvcc searches for different compilation
phases, so this is a required workaround). Detailed instructions can also be found in the archive.

Suggestions/comments are welcomed!

Interesting project!

Well, actually it does have native floating point, but not integer division. It’s a sequence of two instructions, rcpt then mul. But if you use ptx, you can use the ptx divide instruction right? Or are you directly producing cubin?

ok, you are right, I was misled by integer division…

yes, like normal nvopencc, it outputs ptx first, that is, in principle it should simply emit div.f32/f64

but the point is that I don’t know the correct way how to expand expressions with floating-point divisions in terms of open64 -

this part was for some reason not implemented in the trunk’s version -

I can do this only by analogy with already existent routines, however this involves a bit of luck & intuition ;)