Help with coalescing

I have a very simple kernel:

__global__

void

kernel( ushort4 * output, size_t output_pitch )

{

    unsigned int u = blockIdx.x*blockDim.x + threadIdx.x;

    unsigned int v = blockIdx.y*blockDim.y + threadIdx.y;

    ushort4 p = sample_map(u,v);

    ushort4 * out = (ushort4 *)((char *)output+v*output_pitch) + u;

    *out = p;

}

In the profiler I can see that none of the memory writes are coalesced, I am wondering why?

My block size is 8x8, and the address passed as the output pointer is obtained from cudaMallocPitch, which according to the documentation should be correct.

I’m sure I’m missing something stupid but I can’t seem to find it…