Odd copying of global memory to local

Hi. For example, in CUDA kernel i have shared buffer 1025 floats size and i have 512 threads per block. What is the fastest way to fill the buffer?

__global__ void processAL1(float* samples, uchar* out, int samplesNum)
{
        int bidx = blockIdx.x;
	int tidx = threadIdx.x;	//512 threads per block
        __shared__ float sampleBlock[1025];

        // need to copy from global memory 'samples' to 'sampleBlock' from bidx + to bidx+1025 bytes;

Well, i can read on each thread 2 bytes from global and i will read 512*x = 1024, but i need 1025, so what to do?
P.S.: OpenCL has a good function for this - async_work_group_copy, what can i do in CUDA?

the striding loop method I mentioned in your similar question should be efficient here.

[url]https://devtalk.nvidia.com/default/topic/1027320/cuda-programming-and-performance/correct-copying-to-local-shared-memory-/[/url]