Be warned, the volatile trick sometimes backfires.
cuComplex *g_vectorptr; // points to a location in global memory
int index = threadIdx.x;
g_vectorptr[index] = make_cuComplex(1.0f, 2.0f);
volatile int index2 = threadIdx.x;
g_vectorptr[index2] = make_cuComplex(1.0f, 2.0f);
The first expression will be coalesced (a single st.global.v2.f32 write per element), the second write create two individual float writes (st.global.f32) which breaks coalescing. Took me a couple of hours to figure out why my code was running “suboptimally”
Same with float2, float4, and likely all other vector types.