CUDA 4.0 asm operand constraints for 8/16-bit types

For CUDA 4.0, what are the inline asm register constraint strings for 8-bit and 16-bit types? In previous versions, you would use “r” for any operands 32-bits or smaller. For example, you used to be able to:

// Pointer operand constraints for inlining PTX assembly

#if defined(_WIN64) || defined(__LP64__)

	// 64-bit pointer operand constraint for inlined asm

	#define _ASM_PTR_ "l"

#else

	// 32-bit pointer operand constraint for inlined asm

	#define _ASM_PTR_ "r"

#endif

__device__ __forceinline__ void Ld(unsigned short &val, unsigned short* d_ptr, size_t offset) {

	asm("ld.global.cg.u16 %0, [%1];" : "=r"(val) : _ASM_PTR_(d_ptr + offset));

}

But now cudafe complains about the “r” modifier for the unsigned short “val”:

error: asm operand type size(2) does not match type/size implied by constraint 'r'

Thanks,

dgm4d

There is a new PDF in the doc directory in inline assembly. According to it, .u16 registers use the constraint “h”.

Thanks! “h” works perfectly for 16-bit shorts. Hoping that there was just a document omission for 8-bit chars, I tried all 42 lower/upper single-character register constraints, but sadly there doesn’t seem to be anything for the 8-bit types.

Casting seems to work for the time being (since everything’s 32-bit registers anyway), even if it’s a little brittle-looking:

__device__ __forceinline__ void Ld(unsigned char &val, unsigned char* d_ptr, size_t offset) {

	asm("ld.global.cg.u8 %0, [%1];" : "=r"(*reinterpret_cast<unsigned int*>(&val)) : _ASM_PTR_(d_ptr + offset));

}

The code seems to fail on Warp Misaligned Address error. Weird, because u8 is supposed to have no alignment constraints.

Any ideas?