Hello all! I am hard at work on a CUDA-based application, which utilizes inline PTX ASM for optimizations (which is what I’m doing now: coding for speed.) The program, which is in C, has preprocessor #if/#else/#endif statements to section off code for different compute architectures. I have installed in my machine a GeForce GTX 1060 which supports compute 6.1, but whenever I put an asm() statement with PTX in it for sm_6.1, it either doesn’t run because a check for that CUDA version was not successful, and it defaults to my #else statement for sm_52 or lower. Sometimes, but not always.
If I force the PTX code for sm_61 by removing the #if check of CUDA version or specifiying only 6.1 at the commandline, ptxas exits with this error (but ONLY if I have operands that are addresses in that Asm() block):
ptxas cudakerns.compute_61.ptx, line 15101; fatal : Parsing error near ‘&’: syntax error
ptxas fatal : Ptx assembly aborted due to errors
cudakerns.cu
with inline-PTX code like this:
asm volatile ( “{”
" lop3.b32 &output, &e, &d, &c, 0x96;"
" lop3.b32 [&output+4], [&e+4], [&d+4], [&c+4], 0x96;"
" lop3.b32 &output, &output, &b, &a, 0x96;"
" lop3.b32 [&output+4], [&output+4], [&b+4], [&a+4], 0x96;"
“}” );
I set “volatile” because I thought the compilers were messing with my code, but the same issue exists without it. You’ll notice my operands are addresses of C variables in the same scope. You’ll also notice I’m using offsets, here, to read/write to/from operands and the next 32 bits in the uint64_t. This is what I think is my fastest solution for handling these 64-bit values with two 32-bit instructions (also tried using vector types like uint2 but it was sort of clumsy.)
The problem persists if I put around the operands that are addresses… I’ve tried a dozen variations on this code without success. I also tried it with optimization levels 1-3 at ptxas.exe.
I was told that some instructions, if their operands are addresses, need the data from the address instead copied into a register or the code won’t function. No source on this, however, so I can’t confirm if it’s actually true. How inconsistently my code runs if I use address operators instead %0, %1, %2 and the operands after colons makes me wonder what the rules really are on this. The PTX ISA documents were of limited help in this area, where the only time they show an address being used, it’s %txid and they never explain it. It doesn’t seem to be a C variable or an address. I’m hoping someone here can shed a light on what might be causing these two weird behaviors:
- the misbehavior of PTX code with addresses as operands and
- why CUDA version checks might be failing inconsistently like this.
Thanks very much for any help! I’m continuing to research this issue, but I really hope someone here can come to the rescue if I come up short. :)