Inline assembly set.gtu report Unexpected instruction types specified for 'set'

I tried this:

#include "cuda_runtime.h"
#include "cuda_fp16.h"
#include <stdio.h>
//#include <cuda.h>
#include <stdint.h>
#include ""
#include <mma.h>
#include <iostream>
#include <stdio.h>

static inline __device__ void set_gtu_test() {
    printf("Before hello world!!!");
    uint16_t a = 0x0011;
    uint16_t x = a;
    uint16_t mask;
    asm volatile("set.gtu %0, %1, 0;" : "=h"(mask) : "h"(x));
    print("Hello world!");

__global__ void call_set_gtu_test()

int main() {
    call_set_gtu_test<<<1, 1>>>();

with the following compilation command:

/usr/local/cuda-11.4/bin/nvcc -I"/usr/local/cuda/include" --gpu-architecture=compute_87 --gpu-code=sm_87 -o test_CUDA_inline_assembly_set_gtu_simple --use_fast_math --expt-relaxed-constexpr --expt-extended-lambda -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__

on a machine with several DGX A100 cards installed. But only got the following error:

ptxas /tmp/tmpxft_0003f2bc_00000000-6_test_CUDA_inline_assembly_set_gtu_simple.ptx, line 53; error   : Unexpected instruction types specified for 'set'
ptxas fatal   : Ptx assembly aborted due to errors

What could be the cause? Thanks in advance!

A guess here, but " --gpu-architecture=compute_87 --gpu-code=sm_87" appears to be the architecture for Jetson and Drive AGX Orin systems.

Ampere A100 is " --gpu-architecture=compute_80 --gpu-code=sm_80"

In the PTX manual, the specification of the set instruction states:

.dtype = { .u32, .s32, .f32 };

So a 16-bit type for the destination operand (result) is not supported.

Factually, you can find the very similar code:

static inline __device__ uint16_t clamp_to_zero(uint16_t x) {
    uint16_t mask;
    asm volatile("set.gtu %0, %1, 0;" : "=h"(mask) : "h"(x));
    return mask & x;

from here.
Compiles OK. But run no output at all…

I don’t get any indication that that code compiles OK. It does not compile OK for me.

Later: Its not obvious to me that set.gtu is a syntactically correct PTX instruction. From what I can tell, at a minimum it requires both a stype and a dtype.

Yes, I acknowledge some NVIDIA code has it. I can “sort of” get that particular function to compile if I don’t use it. But if I attempt to use it, I get the same PTX error reported in the original post in this thread. So I don’t believe it is actually valid, and my guess is that codebase never uses that function (although I have not grepped for it.) Or else there is something else I don’t understand. FWIW the other usage in that file seems to be formatted better/differently.

In any event the error seems to point directly to this issue. The closest I can get to something relevant would be:

The closest I can come that compiles is:

static inline __device__ uint16_t clamp_to_zero(uint16_t x) {
    uint32_t mask;
    asm volatile(" %0, %1, 0;" : "=r"(mask) : "h"(x));
    return mask & x;