how does the Global Memory Coalescing works

// a, b, c, d are all m * m matrix
__global__ void Compute(float* a, float *b, float *c, float* d, int m) {
  int ix = blockDim.x * blockIdx.x + threadIdx.x;
  int iy = blockDim.y * blockIdx.y + threadIdx.y;
  int idx = m * iy + ix;
  d[idx] = a[idx] + b[idx] + c[idx];
}

I used two dimension thread block, to do matrix operations d = a + b +c. When i used blockDim.x = 8, the performance is poor, when i used blockDim.x = 16 or blockDim.x = 32, the performance is good and pretty similiar. As i known the the l2 cache memory transaction is 32-bytes, why blockDim.x = 16 and 32 is better than blockDim.x = 8?

When m = 2048, i used nvprof to show the memory access
When blockDim.x = 8, l2_read_transactions is 1572928 (2048 * 2048 * 4-bytes / 32-bytes) * 3
l2_write_transactions is 1048576 (2048 * 2048 * 4-byte / 32-bytes) * 2
When blockDim.x = 16, l2_read_transactions is 1572928 (2048 * 2048 * 4-byte / 32-bytes) * 3
l2_write_transactions is 524301 (2048 * 2048 * 4-byte / 32-bytes)
When blockDim.x = 32, l2_read_transactions is 1572928 (2048 * 2048 * 4-bytes / 32-bytes) * 3
l2_write_transactions is 524301 (2048 * 2048 * 4-bytes / 32-bytes)

Why the l2 write transaction need to multiple 2 when blockDim.x = 8?

I am really confused, hope you can help. Thanks a lot.