I had seen that there has an instruction of LDS.128. But is anybody known how to writing a C code to generate this instruction? thanks!
If your code loads double2, float4, or int4 data from shared memory, you should see a LDS.128 being generated (provided your code uses all components of the vector, otherwise the compiler may optimize the code to use narrower loads).