Profiler says global memory loads are not coalesced

I am using Compute Visual Profiler on Windows for my Fermi card, and when I run my program, the profiler tells me under Hints that "Memory access pattern is not coalesced. " and "Access pattern of global memory load is not coalesced resulting in multiple transactions. In perfectly coalesced access 1 gld instruction for 32, 64,128 bit word size should cause 1,2 and 4 L1 cache line(128 byte) accesses respectively. "

I have no clue why. My block size is 64 threads, and I have a load from global memory into shared memory in each iteration of my for loop with no other global memory accesses. Here is a code snippet:

#define LENGTH 65

// avoid shared memory bank conflict

__shared__ double Sd_buffer0[4*LENGTH];

__shared__ double Sc_buffer0[32];

__shared__ double Sd_buffer1[4*LENGTH];

__shared__ double Sc_buffer1[32];

for (int n = 0; n < 80; n += 2)

{	

	double temp0 = Sd[32*blockIdx.y + 32*blockIdx.x + n*8000 + threadIdx.x];

	double temp1 = Sd[32*blockIdx.y + 32*blockIdx.x + (n+1)*8000 + threadIdx.x];

	Sd_buffer0[threadIdx.x] = temp0;

	Sd_buffer1[threadIdx.x] = temp1;

	Sd_buffer0[threadIdx.x + LENGTH] = temp0;

	Sd_buffer1[threadIdx.x + LENGTH] = temp1;	

	Sd_buffer0[threadIdx.x + 2*LENGTH] = temp0;

	Sd_buffer1[threadIdx.x + 2*LENGTH] = temp1;

	Sd_buffer0[threadIdx.x + 3*LENGTH] = temp0;

	Sd_buffer1[threadIdx.x + 3*LENGTH] = temp1;

	

	if (threadIdx.x < 32)

	{

		Sc_buffer0[threadIdx.x]= Sc[32*blockIdx.y + n*8000 + threadIdx.x];

		Sc_buffer1[threadIdx.x]= Sc[32*blockIdx.y + (n+1)*8000 + threadIdx.x];

	}

	

	__syncthreads();

	

	// process using shared memory

	

	// write to global memory in a coalesced fashion

	gamma1b[index + (threadIdx.x % 32)] = register1;

	...

}

So my question is is there any reason that my memory accesses are not being coalesced? Has anyone else experienced this? Thank you in advance for your help.