I am using Compute Visual Profiler on Windows for my Fermi card, and when I run my program, the profiler tells me under Hints that "Memory access pattern is not coalesced. " and "Access pattern of global memory load is not coalesced resulting in multiple transactions. In perfectly coalesced access 1 gld instruction for 32, 64,128 bit word size should cause 1,2 and 4 L1 cache line(128 byte) accesses respectively. "
I have no clue why. My block size is 64 threads, and I have a load from global memory into shared memory in each iteration of my for loop with no other global memory accesses. Here is a code snippet:
#define LENGTH 65
// avoid shared memory bank conflict
__shared__ double Sd_buffer0[4*LENGTH];
__shared__ double Sc_buffer0[32];
__shared__ double Sd_buffer1[4*LENGTH];
__shared__ double Sc_buffer1[32];
for (int n = 0; n < 80; n += 2)
{
double temp0 = Sd[32*blockIdx.y + 32*blockIdx.x + n*8000 + threadIdx.x];
double temp1 = Sd[32*blockIdx.y + 32*blockIdx.x + (n+1)*8000 + threadIdx.x];
Sd_buffer0[threadIdx.x] = temp0;
Sd_buffer1[threadIdx.x] = temp1;
Sd_buffer0[threadIdx.x + LENGTH] = temp0;
Sd_buffer1[threadIdx.x + LENGTH] = temp1;
Sd_buffer0[threadIdx.x + 2*LENGTH] = temp0;
Sd_buffer1[threadIdx.x + 2*LENGTH] = temp1;
Sd_buffer0[threadIdx.x + 3*LENGTH] = temp0;
Sd_buffer1[threadIdx.x + 3*LENGTH] = temp1;
if (threadIdx.x < 32)
{
Sc_buffer0[threadIdx.x]= Sc[32*blockIdx.y + n*8000 + threadIdx.x];
Sc_buffer1[threadIdx.x]= Sc[32*blockIdx.y + (n+1)*8000 + threadIdx.x];
}
__syncthreads();
// process using shared memory
// write to global memory in a coalesced fashion
gamma1b[index + (threadIdx.x % 32)] = register1;
...
}
So my question is is there any reason that my memory accesses are not being coalesced? Has anyone else experienced this? Thank you in advance for your help.