Inline PTX and Mov

I’m trying to utilise the mov instruction, in it’s pack/unpack mode: PTX ISA :: CUDA Toolkit Documentation

The write up states, “…write vector register d with the unpacked values from scalar register a.”.

Compiling the following:

__device__ __inline__ uchar2 mov(uint16_t srcA) {
    uchar2 ret;
    asm ("mov.b16 %0, %1;" : "=h"(ret) : "h"(srcA));
    return ret;
}

brings the error message: “an asm operand must have scalar type”.

Is this incorrect PTX syntax, a bug or is this instruction mode not available via PTX?

uchar2 is a vector type. This works for me:

#include <stdint.h>
__device__ __inline__ ushort mov(uint16_t srcA) {
    ushort ret;
    asm ("mov.b16 %0, %1;" : "=h"(ret) : "h"(srcA));
    return ret;
}

If you want your original function prototype, this also seems to work:

#include <stdint.h>
__device__ __inline__ uchar2 mov(uint16_t srcA) {
    ushort ret1;
    asm ("mov.b16 %0, %1;" : "=h"(ret1) : "h"(srcA));
    uchar2 ret = *reinterpret_cast<uchar2 *>(&ret1);
    return ret;
}

Yes, that is the purpose of the instruction, to convert from scalar to vector, or am I completely missing something here (written prior to the second example above).

The purpose of the mov instruction is to copy data between registers.

Vector types can be used in source or destination when packing or unpacking of the vector is requested.

I think what you are asking about here is type conversion.

It seems that the proximal source of the error is here:

Only scalar variables are allowed as asm operands. Specifically aggregates like ‘struct’ type variables are not allowed

Thanks for all that Robert. So it seems to be an Inline PTX limitation, as the documentation entry, as I linked in the OP implies, (to me anyway), that a straightforward conversion from vector to scalar or vice versa, is possible.

All of the examples I see in your doc link show pack/unpack examples, not conversion from vector to scalar.

I’m clearly not getting some concept here (not that I’m expecting you to spend more time on it). The statement, “write vector register d with the unpacked values from scalar register a.” seem like a conversion to me.

Nomenclature may differ between programmers. When simply moving bits between variables of different types, I would call that re-interpretation, not conversion. So uint32_t x; __int_as_float(x) is a re-interpretation, while uint32_t x; float(x) is a conversion. Below is a (not very useful) worked example using different flavors of mov.

The problem in the thread-starting post is with the binding, i.e. the "=h". Binding a symbolic operand in inline PTX to a CUDA HLL variable requires that variable to be of scalar type, but ret is of vector type uchar2. I am reasonably sure this is a limitation of the inline-assembly-operand binding mechanism common to clang, gcc, etc. Robert Crovella’s first post in the thread already showed how to get around this limitation.

__device__ ulonglong2 foo (unsigned long long a, unsigned long long b)
{
    ulonglong2 res;

    asm ("{\n\t"
         ".reg .u32       alo, ahi, blo, bhi, r0, r1, r2, r3;\n\t"
         ".reg .u16       a0, a1, a2, a3, b0, b1, b2, b3;\n\t"
         "mov.b64         {alo,ahi}, %2;\n\t"
         "mov.b64         {blo,bhi}, %3;\n\t"
         "mov.b32         {a0,a1}, alo;\n\t"
         "mov.b32         {a2,a3}, ahi;\n\t"
         "mov.b32         {b0,b1}, blo;\n\t"
         "mov.b32         {b2,b3}, bhi;\n\t"
         "mul.wide.u16    r0, a0, b0;\n\t"
         "mul.wide.u16    r1, a1, b1;\n\t"
         "mul.wide.u16    r2, a2, b2;\n\t"
         "mul.wide.u16    r3, a3, b3;\n\t"
         "mov.b64         %0, {r0,r1};\n\t"
         "mov.b64         %1, {r2,r3};\n\t"
         "}"
         : "=l"(res.x), "=l"(res.y)
         : "l"(a), "l"(b));

    return res;
}

yes, I used the word conversion incorrectly. It’s not a type conversion.

Thanks again both, in particular Norbert for the example. I was meaning scalar → vector conversion.

I was reading the doc. under the impression that the compiler recogised the in/output types and the hardware did a “black box” operation in one instruction. Clearly not.

I personally would not make assumptions about hardware behavior (SASS) based on PTX. I view that as a common error. I studied SASS for a bunch of the examples I worked thru, and they all were quite efficient.

PTX does not represent the hardware.

Here is an example:


$ cat t1912.cu
#include <stdint.h>
__device__ __inline__ uchar2 mov(uint16_t srcA) {
    ushort ret1;
    asm ("mov.b16 %0, %1;" : "=h"(ret1) : "h"(srcA));
    uchar2 ret = *reinterpret_cast<uchar2 *>(&ret1);
    return ret;
}
__device__ uchar2 dest;
__global__ void k(uint16_t src){
    dest = mov(src);
}
$ nvcc -c t1912.cu -arch=sm_52
$ cuobjdump -sass t1912.o

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_52
                Function : _Z1kt
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                      /* 0x001c4400fe0007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;             /* 0x4c98078000870001 */
        /*0010*/         {         MOV32I R2, 0x0 ;                   /* 0x010000000007f002 */
        /*0018*/                   LDC.U16 R0, c[0x0][0x140]         }
                                                                      /* 0xef9200001407ff00 */
                                                                      /* 0x001fbc01fe2007ff */
        /*0028*/                   MOV32I R3, 0x0 ;                   /* 0x010000000007f003 */
        /*0030*/                   STG.E.U16 [R2], R0 ;               /* 0xeeda200000070200 */
        /*0038*/                   NOP ;                              /* 0x50b0000000070f00 */
                                                                      /* 0x001ffc00ffe007eb */
        /*0048*/                   NOP ;                              /* 0x50b0000000070f00 */
        /*0050*/                   EXIT ;                             /* 0xe30000000007000f */
        /*0058*/                   BRA 0x58 ;                         /* 0xe2400fffff87000f */
                                                                      /* 0x001f8000fc0007e0 */
        /*0068*/                   NOP;                               /* 0x50b0000000070f00 */
        /*0070*/                   NOP;                               /* 0x50b0000000070f00 */
        /*0078*/                   NOP;                               /* 0x50b0000000070f00 */
                ..........



Fatbin ptx code:
================
arch = sm_52
code version = [7,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$

If we acknowledge that a load and store are necessary, then the compiler generated 0 extra instructions related to the “conversion”.

The whole point of PTX-level bit moves being used for re-interpretation is that they are usually removed entirely when ptxas applies optimizations. At GPU hardware level a register is a register is a register. But for a programming language environment with typed variables we need some way of expressing re-interpretation. CUDA intrinsics and PTX mov are still a lot more obvious than having to use a call to memcpy() for a totally standard-conforming re-interpretation in standard C++.

Yes. I’ve just convinced myself of that and learnt quite a bit in the process :-)