MOV Confusion

With reference to the MOV instruction, “9.7.8.3. Data Movement and Conversion Instructions: mov - Move vector-to-scalar (pack) or scalar-to-vector (unpack).” in PTX ISA v6.5.

In an effort to directly use the upper and lower bytes in a uint16_t, I am attempting the following:

uint16_t a[128] = {....};
uint16_t b,c;
uint32_t d[8];

static __device__ __forceinline__ uint16_t unpack(uint16_t a) {
            uint16_t b;
            asm("mov.b16 {x,y}, %1 ;" : "=h"(b) : "h"(a));
            return b;
}


b = a[15];
c = unpack(b);
d[0] = d[1] ^ c.x;
d[1] = d[2] ^ c.y;

and at this point the last two lines throw up, “expression must have class type” errors.
I’ve tried using uchar2 for c and in the appropriate places in the unpack(), but that seems not legal.

Have I made a basic syntax error here, or is this not possible?

I realise I can use:

d[0] = d[1] ^ (b >> 8);
d[1] = d[2] ^ (b & 0xFF);

but I’m trying to see if using MOV is faster. I am very much new to C/C++ if that’s any excuse for a glaring error here…

Thanks.

To the best of my knowledge GPU hardware provides no direct way to access the individual bytes in a 32-bit register. You can extract them, with PRMT (byte-permute), BFE (bit field extract), or shift & mask.

In your example code, ‘x’ and ‘y’ are undefined. At minimum you would have to declare them as .u8 variables, if that is supported. Bindings for inline assembly always start with %0 for the first bound object in text order.

Thanks. I’m just about to go out, so don’t have time to carry on just now, but will just observe that the compiler (Visual Studio), isn’t giving me any errors related to the inline asm and I’d have thought that the x and y were defined as the upper and lower byte of the operand, going by the instruction description. Otherwise, I’m not sure what this instruction actually does.

Byte permute, BFE and shift& mask all give the same performance, so I wanted to give this a try.

Regards.

You can certainly use uchar2 to do what you want, but if you look at the generated code, it will likely show the use of shift and mask to extract the bytes, like in this example built for sm61.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

__global__  void kernel (uint16_t halfword)
{
    union {
        uint16_t hw;
        uchar2 bytes;
    } cvt;
    cvt.hw = halfword ^ 0x1357;
    printf ("the low  byte is: %02x\n", cvt.bytes.x);
    printf ("the high byte is: %02x\n", cvt.bytes.y);
}

int main (void)
{
    kernel<<<1,1>>>(0xaa55);
    cudaDeviceSynchronize();
    return EXIT_SUCCESS;
}
/*0078*/                   LOP32I.AND R0, R3, 0xff;        // extract low byte

/*0088*/         {         MOV R6, R16;
/*0090*/                   STL [R17], R0;        }
/*0098*/                   MOV R7, R2;

/*00a8*/                   SHR.U32 R18, R8, 0x8;           // extract high byte (step 1)
/*00b0*/                   JCAL 0x0;
/*00b8*/                   LOP32I.AND R0, R18, 0xffff;     // extract high byte (step 2)

/*00c8*/         {         MOV R6, R16;
/*00d0*/                   STL [R17], R0;        }
/*00d8*/                   MOV R7, R2;

Thanks again for your effort. It looks like this will be worse.

Unsurprisingly, I’ve had no luck building an asm() inline that uses the MOV instruction. I guess it’s my oversimplistic interpretation of the description, “For bit-size types, mov may be used to pack vector elements into a scalar register or unpack sub-fields of a scalar register into a vector.”

The image I get is that it does a conversion from one to the other. Am I wrong in thinking a vector register allows direct access to it’s components - x,y?

Thinking that the destination may need to be an existing vector type, if I try using uchar2 inside the asm, the compiler tells me that “asm operands can only be scalar”, which is somewhat perplexing, as the instruction is dealing with a scalar and a vector. Perhaps this means the instruction cannot be used inline?

It’s certainly been a learning experience. Looks like I’ll just have to take the 30% throughput improvement I’ve made to this point.

There is no constraint for a byte-sized operand, because the GPU does not have byte-sized (sub-)registers. See section 1.1.2 of the linked document for available contraint types: https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html

If there were a byte constraint, you could do this:

asm ("mov.b16 {%0,%1},%2;\n\t" : "=b"(lo), "=b"(hi) : "h"(halfword));

MOV {to | from} operand pairs is probably better thought of as a re-interpretation than a conversion. In that sense it is equivalent to the union approach I showed earlier.

There are no vector registers per se in the GPU. All registers are 32 bit, but there is usually a way to access high and low halves (as I recall the details differ by architecture).

64-bit operands occupy two registers. When operated on by instructions that work on 64-bit operands, the two registers must be a pair, with the low-order bits in the even-numbered register and the high-order bits in the odd-numbered register.

That makes complete sense and a shame this isn’t an option. I do appreciate sub 32bit integer logic is probably quite niche in the scheme of things and looking at Turing, seems to be getting more so, (although I’ve not yet done testing there). I note throughputs for integer add, sub, AND, OR, XOR and presumably LOP3 have halved compared to Pascal, and BFE/BFI are now multiple instructions.

I guess they need the transisters for the RTX stuff…

Depending on what you do, you might want to look into processing four bytes in parallel, using full 32-bit registers. CUDA has already some device intrinsics defined for such work.

The basic design philosophy of GPUs is very close to classical RISC, in that they try to make do with a small set of simple instructions. This allows many more cores to be incorporated per SM, or a much larger number of SMs.

I have no insight into how many percent of Turing’s die area are dedicated to RTX units, but note that the principal heavy-duty compute platform at this point in time is still Volta.