Issue using inline PTX functions, with address operands, in CUDA application - Any help much appreciated!

Hello all! I am hard at work on a CUDA-based application, which utilizes inline PTX ASM for optimizations (which is what I’m doing now: coding for speed.) The program, which is in C, has preprocessor #if/#else/#endif statements to section off code for different compute architectures. I have installed in my machine a GeForce GTX 1060 which supports compute 6.1, but whenever I put an asm() statement with PTX in it for sm_6.1, it either doesn’t run because a check for that CUDA version was not successful, and it defaults to my #else statement for sm_52 or lower. Sometimes, but not always.

If I force the PTX code for sm_61 by removing the #if check of CUDA version or specifiying only 6.1 at the commandline, ptxas exits with this error (but ONLY if I have operands that are addresses in that Asm() block):

ptxas cudakerns.compute_61.ptx, line 15101; fatal : Parsing error near ‘&’: syntax error
ptxas fatal : Ptx assembly aborted due to errors

with inline-PTX code like this:

asm volatile ( “{”
" lop3.b32 &output, &e, &d, &c, 0x96;"
" lop3.b32 [&output+4], [&e+4], [&d+4], [&c+4], 0x96;"
" lop3.b32 &output, &output, &b, &a, 0x96;"
" lop3.b32 [&output+4], [&output+4], [&b+4], [&a+4], 0x96;"
“}” );

I set “volatile” because I thought the compilers were messing with my code, but the same issue exists without it. You’ll notice my operands are addresses of C variables in the same scope. You’ll also notice I’m using offsets, here, to read/write to/from operands and the next 32 bits in the uint64_t. This is what I think is my fastest solution for handling these 64-bit values with two 32-bit instructions (also tried using vector types like uint2 but it was sort of clumsy.)

The problem persists if I put around the operands that are addresses… I’ve tried a dozen variations on this code without success. I also tried it with optimization levels 1-3 at ptxas.exe.

I was told that some instructions, if their operands are addresses, need the data from the address instead copied into a register or the code won’t function. No source on this, however, so I can’t confirm if it’s actually true. How inconsistently my code runs if I use address operators instead %0, %1, %2 and the operands after colons makes me wonder what the rules really are on this. The PTX ISA documents were of limited help in this area, where the only time they show an address being used, it’s %txid and they never explain it. It doesn’t seem to be a C variable or an address. I’m hoping someone here can shed a light on what might be causing these two weird behaviors:

  • the misbehavior of PTX code with addresses as operands and
  • why CUDA version checks might be failing inconsistently like this.

Thanks very much for any help! I’m continuing to research this issue, but I really hope someone here can come to the rescue if I come up short. :)

Oh, I should also mention: this very similar function DOES compile without the syntax error. Trying to verify that it is actually running/not being inhibited by CUDA version checks, but seems fine… Yet the one in the OP, no luck. I can’t seem to spot any syntactical difference that is tripping ptxas up. Thanks.

device forceinline
uint64_t altFunc( uint64_t a, uint64_t b, uint64_t c )
printf ("CUDA__ARCH ");

#if CUDA_ARCH >= 500 && CUDA_VERSION >= 7050
uint64_t output = 0;
asm( “{”
" lop3.b32 [&output], [&a], [&b], [&c], 0xD2;"
" lop3.b32 [&output+4], [&a+4], [&b+4], [&c+4], 0xD2;"
“}” );
return output;
return a ^ ((~b) & c);

since you choose the code by compile-time preprocessor, the code chosen depends on compilation flags rather than your actual hardware. so, you should explicitly compile for 6.1 in order to get this code enabled. fortunately, nvcc allows to compile for multiple targets and then automatically chooses best code depending on GPU model. check the docs or wait for other answers :)

Thanks for the response! To clarify I am using command line Israelites to compile for 61 and 52 for compatibility reasons. Going to use CUDA functions to detect hardware capability. Puzzled why syntax errors seem to come up for one version and not the other. Also checking what version lop3 actually meds.

Parameters not Israelites. Thanks autocorrect

Have you read the inline PTX documentation?

Because your asm functions don’t appear to be adhering to the recommended format at all.

I did, otherwise I’d have had no idea how to get started at this. The area I’m asking about doesn’t appear to be covered- specifically, using addresses rather than C variables as operands. The examples nV give are generally in the format asm( “my instructions here” : output (=) and output/input (+) operands : input operands ); whereas I am omitting the part with the operands after the curly braces and the %n for specifying an operator within the quotes in favor of specifying addresses directly. Is that what you were referring to?..

I’ll read over the document again, but I’m really not seeing what I am doing wrong- which is why I asked. Thanks.

Ooh, of you were referring to the second function I posted, I did not write that one- I just had some input on it while trying to figure out the correct syntax. So I can’t say much about it. But it does work as intended, addresses offsets and all. Why the first one doesn’t also work and gives that syntax error about & is alien to me.

Apparently the “correct” way (I had to ask someone, I could not find it in the 75+ pages of nV documentation I’ve read trying to understand PTX better)- is to put something like

… : “l”(&myCVar) );

But trying to do an offset in the PTX like so: [%0+4] … for four bytes offset doesn’t seem to work. I repeat: the second function I posted DOES work, addresses, offsets and all even though I can’t seem to find documentation supporting that formatting…