Use vector load data from global mem to shm

my code load data from global mem to shm is slow, can i use vector load like this?
extern shared float sdata;
int total_id = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x;
if (tid % 4 == 0) reinterpret_cast<float*>(sdata)[tid] = reinterpret_cast<float*>(in)[total_id];
__syncthreads(); …
but sdata not right, some parts is 0, why? and is it any better using vector load like this?

Some parts are zero, because only each 4th thread participates. You use if (tid % 4 == 0)