coalesced read to shared memory

Hi Guys,
Just want to find out if my kernel read global data coalesced or not. Please help, I have struggled for a long time.

here is my thread block and grid size
blocksize(8, 16)
gridsize(1, 8192/16, 1)

here is my kernel

global void myKernel(const float* D, const float* L)
int tx = threadIdx.x;
int ty = threadIdx.y;
int by = blockIdx.y;
int tidy = blockIdx.y * blockDim.y + threadIdx.y;

__shared__ Real Ls[16][8];
__shared__ Real Ds[16];

Ds[ty] = D[tidy];
Ls[ty][tx] = L[ty*8+tx];

L is a matrix size of 8192*8
D is a vector size of 8192

my thread block size is (8,16), so each thread loads 1 element to shared memory from global memory. When I profiled my code, I found a large portion of uncoalesced read in this kernel. If anyone familiar with coalescing please help.

Thank you.