Inline PTX problem Probably an easy fix?

Greetings,

I’m trying to do some experimentation with the various load and store cache option exposed at the PTX level which aren’t exposed at the C level. Can someone tell me what I’m doing wrong in the following code snippet, where I’m trying to a cache streaming store of a float4?

__device__  inline void store_streaming_float4(float4* addr, float x, float y, float z, float w)

{

  asm("st.cs.global.v4.f32 [%0+0], {%1, %2, %3, %4};" : "=l"(addr): "f"(x), "f"(y), "f"(z), "f"(w));

}

// call streaming store

store_streaming_float4(out+i, x, y, z, w);

// replaces this

out[i] = make_float4(x, y, z, w);

Here, out is a float4 pointer, and x, y, z, w are floats. When I use the inline ptx code instead of the original, the code produces segmentation fault. This is likely an easy fix, I imagine I’m doing something dumb.

Thanks.

try: “=r”(addr)

I don’t think this is the problem. Quoting the inline PTX guide:

"h" = .u16 reg

"r" = .u32 reg

"l" = .u64 reg

"f" = .f32 reg

"d" = .f64 reg

I’m compiling in 64-bit mode, so my addr pointer should use a 64-bit constraint, hence why I use “l” and not “r”.

Maybe you could try doing it in 32-bit mode first to isolate the problem? It appears to me that the problem is just inside the square bracket. Or it could be some problem with addr. Maybe you could check the cuobjdump output to take a look at how ptxas handles this series of instructions.

[font=“Courier New”]addr[/font] is input to the inline asm, not output. So the correct constraint would be

__device__  inline void store_streaming_float4(float4* addr, float x, float y, float z, float w)

{

  asm("st.cs.global.v4.f32 [%0+0], {%1, %2, %3, %4};" :: "l"(addr), "f"(x), "f"(y), "f"(z), "f"(w));

}

Thank you. Works fine now.