How to use vector loads in C for CUDA?

How should one write C for CUDA code to use vector load instructions? For instance, I try

shared float2 smem[N];
. . .
float2 x = smem[j];

or

shared float smem[2*N];
. . .
float x[2];

x[0] = smem[2j];
x[1] = smem[2
j+1];

But in both cases I get a pair of ld.shared.f32 PTX instructions instead of a single ld.shared.v2.f32

Is there any way to do it?