Is the instruction "mov.b32 {r,g,b,a},%r1;" even supported??

cheinger · July 16, 2016, 1:04pm

Hey experts!

I am having trouble with splitting an integer into 4 uint8_ts.

I can’t seem to get this to work. It just says Arguments mismatch for instruction ‘mov’ but I can’t find the constraint character for a uint8_t or b8.

U32 t = 0x00112233;
U32 a, b, c, d;
asm("mov.b32 {%0,%1,%2,%3}, %4; " : "=r"(a), "=r"(b), "=r"(c), "=r"(d) : "r"(t));

Any thoughts?

njuffa · July 16, 2016, 3:33pm

See: [url]Missing CUDA inline PTX constraint letter for 8 bit variables in order to disable L1 cache for 8 bit variable (bool) - Stack Overflow

scottgray · July 16, 2016, 5:50pm

If you provided a little context about what you’re trying to do I could probably suggest a way around the ptx limitation.

For example, you could use bfe (bit field extract) to unpack each element into integer values (this is likely what the mov instruction would get converted to anyway if it worked since you don’t need sign extension.)

Or you could look at the scalar video instructions (vmad, vadd, vabsdiff, etc) and perhaps do what you need to do in place.

njuffa · July 16, 2016, 9:03pm

Instead of using the video instructions at PTX level, I would suggest using CUDA’s SIMD intrinsics instead, otherwise you could find yourself in a world of hurt when working on a non-Kepler platform.

scottgray · July 17, 2016, 1:05am

I think the scalar video ptx instructions are fine. It’s the SIMD video instructions you want to avoid (vadd2, vadd4, etc). Maxwell and Pascal have native support for VADD, VMAD, VABSDIFF, VMNMX, and VSET.

And of course sm_61 now has the dp4a and dp2a instructions as well.

njuffa · July 17, 2016, 2:54am

Since it seemed to me that the OP was interested in packed-byte operation I assumed interest would be in the SIMD video instructions. When or where are the scalar video instructions useful, in your experience? I have never had a need to use them, nor did I come across a customer use case where these seemed beneficial.

The utility of the SIMD video instructions for various image processing and bioscience applications is immediately obvious, and I wasn’t happy when I learned that NVIDIA decided to largely rip them out post Kepler (I think support for byte-wise absolute differences was retained).

For just re-interpreting the data in two different ways, a union of ‘unsigned int’ and ‘uchar4’ should work well enough, I don’t see the need to drop to PTX inline assembly for that.

scottgray · July 17, 2016, 8:12pm

njuffa’s point about the union is a good way to avoid having to use explicit bfe instructions. Just let the compiler work that out for you. My thinking tends to be biased to the low level implementation side of things.

As far as scalar video instructions… they can be handy for doing things in fixed point math. But vmad is actually pretty useful for multiplying two integers when you know the operands are within 16 bits. On Maxwell/Pascal The compiler typically generates at least 3 XMADs for an integer multiple of any integer data types. You can use something like this to get that down to a single instruction and reduced register pressure:

__device__ __forceinline__ int mad16(int a, int b, int c)
{
    int res;
    asm("vmad.s32.u32.u32 %0, %1.h0, %2.h0, %3;" : "=r"(res) : "r"(a), "r"(b), "r"(c));
    return res;
}

It would be nice if the compiler were a little smarter about using the minimal number of XMADs for integer multiplies of various data types.

njuffa · July 17, 2016, 10:48pm

Thanks for the insight into the utility of scalar video instructions.

The number of integer multiply variants (at PTX level) that need to be emulated by XMAD sequences on Maxwell and following architectures is quite high, around twenty if I recall correctly. For wider data types, one can usually construct multiple variants with different throughput, latency, and register pressure making it difficult to pick one that is “optimal”.

Although I was not directly involved, as I had grown tired of cooking up emulation sequences, I am under the impression that considerable effort went into the construction of fast integer multiply emulation code for Maxwell, which however does not mean better code cannot be found. Optimization is very much an iterative process, with additional improvements possible at each iteration.

If you know of any emulation sequences that appear particularly deficient, or have ideas for improved variants, I would suggest filing RFEs to get those improvements into PTXAS for future versions of CUDA.

scottgray · July 18, 2016, 4:16am

I basically just want variants of these two (the first can be achieved with vmad):

signed and unsigned 16bit * 16bit:

XMAD.U16.U16 d, a, b, c;
XMAD.S16.S16 d, a, b, c;

signed and unsigned 16bit (a) * 32bit (b):

XMAD.U16.U16     d, a, b,    c;
XMAD.U16.U16.PSL d, a, b.H1, d;

XMAD.S16.S16     d, a, b,    c;
XMAD.S16.S16.PSL d, a, b.H1, d;

This second one is particularly handy for pointer offset arithmetic. It requires no additional registers and is 1 instruction shorter than the standard 3 instructions used.

Topic		Replies	Views
Missing CUDA inline PTX constraint letter for 8 bit CUDA Programming and Performance	6	1450	November 25, 2015
MOV Confusion CUDA Programming and Performance	7	973	February 6, 2020
Generating XMAD{.X,.CC} by PTX CUDA Programming and Performance	4	1075	February 12, 2019
XMAD meaning CUDA Programming and Performance	17	6931	April 10, 2017
CUDA intrinsics? CUDA Programming and Performance	7	3529	November 16, 2017
The instructions of VABSDIFF4 increases in CUDA9.2 (VOLTA) CUDA Programming and Performance	8	1633	June 1, 2018
Peak Performance of integer operation CUDA Programming and Performance	3	2858	May 11, 2017
Inline PTX and Mov CUDA Programming and Performance	13	1021	November 2, 2021
A question about calculation of integer (or short integer) and float data CUDA Programming and Performance	8	3319	April 4, 2014
long-integer multiplication: mul.wide.u64 and mul.wide.u128 CUDA Programming and Performance	31	7605	January 2, 2018

Is the instruction "mov.b32 {r,g,b,a},%r1;" even supported??

Related topics