uncached read ptx failure

I have the following simple code snippet of inline PTX which should write two variables to global memory without caching the data.

If both INT_NO_CACHE and SHORT_NO_CACHE are defined, the code does not work ( shorts are not written at all)
However if either one is defined, the code works perfectly well.

Merging the two inline assembly pieces into a single one does not solve the problem
Any ideas?

device inline void _writeNoCache(offset_t* dst_uint, ushort* dst_ushort, uint val_uint, ushort val_ushort )


	".reg .u64 addr;\n\t"
	"mov.u64 addr, %0; \n\t"
	"st.global.cs.u32 [addr], %1; \n\t" 
	::"l"(dst_uint), "r"(val_uint));



“.reg .u64 addr2;\n\t”
“mov.u64 addr2, %0; \n\t”
“st.global.cs.u16 [addr2], %1; \n\t”
::“l”(dst_ushort), “h”(val_ushort));