I’m trying to hand-code a function to be used in my CUDA project in PTX. I need to create a function to add two numbers of arbitrary precision (with carry) using the addc.cc (Page 41, PTX ISA Documentation 1.2) assembler operation, since I see no way that this operation is exposed in a level higher than PTX.
Of my limited understanding of PTX so far, one of the best ways to do this is to write a skeleton device function code implementing a basic add operation, and then hand-tune the corresponding PTX file to change the required add operations to addc.cc obtained using nvvc ptx generation and then proceed with the compilation.
Surely there must be an easier way to do this? Is there no way to access add with carry in higher-level CUDA?
We’ve been groaning about inline assembly since forever.
Is adding two numbers all your kernel does? If so, it’s not too hard to write a PTX kernel and link it in. There’s an automatic run-time linking facility (ie, device code repository) that I explained here: [url=“http://forums.nvidia.com/index.php?act=ST&f=71&t=44562”]http://forums.nvidia.com/index.php?act=ST&f=71&t=44562[/url] and is mentioned in the docs. It was a year ago but hopefully still applies. If this is part of a larger kernel, then it’d probably be too much to rewrite and maintain it as assembly. You’ll just have to juggle uint64s I’d guess.
P.S. If you write your own PTX kernel, start from scratch. Look at nvcc -ptx output to get a feel for how it’s done, but then do it yourself. If you start with compiler-generated code you’ll have a mess to work with.
Sadly, no. I need to write a device function that is to be called several times from a big kernel to add two arbitrary sized numbers together (eg. two 512 bit numbers). I wish to implement this by breaking down the additions into 32 bits each and using add with carry.
Any other ideas? I’m pretty sure using PTX in the first place is going to be a pain. I can no longer work with the comfortable make clean;make;execute routine.
Oh come on. You can put make clean, make, execute and any other build steps into a shell script and even save yourself typing.
Anyway, emulating the carry bit using 64-bit integers should work out ok. It’s probably gonna use the add-with-carry instruction anyway and you’re looking at only maybe a 2x slowdown. That shouldn’t be bad, it’s a fast operation anyway.