I’m trying to do some experimentation with the various load and store cache option exposed at the PTX level which aren’t exposed at the C level. Can someone tell me what I’m doing wrong in the following code snippet, where I’m trying to a cache streaming store of a float4?
__device__ inline void store_streaming_float4(float4* addr, float x, float y, float z, float w)
{
asm("st.cs.global.v4.f32 [%0+0], {%1, %2, %3, %4};" : "=l"(addr): "f"(x), "f"(y), "f"(z), "f"(w));
}
// call streaming store
store_streaming_float4(out+i, x, y, z, w);
// replaces this
out[i] = make_float4(x, y, z, w);
Here, out is a float4 pointer, and x, y, z, w are floats. When I use the inline ptx code instead of the original, the code produces segmentation fault. This is likely an easy fix, I imagine I’m doing something dumb.
Maybe you could try doing it in 32-bit mode first to isolate the problem? It appears to me that the problem is just inside the square bracket. Or it could be some problem with addr. Maybe you could check the cuobjdump output to take a look at how ptxas handles this series of instructions.