coalescing a finite difference operator

I have an FDTD code written in CUDA. I ran the code on 8800GT and GTX 295, I found the speed of 295 is about 15x faster. Using visual profiler, I found on 8800GT, all of the gld and gst operations are non-coalesced. I think the non-coalescing of the global memory access is guilty for the slow-down.

I am seeking advices to convert my kernel to coalesced access, to gain 16x fold speed-up on 8800 and likely an additional 2x fold on 295.

My kernel is quite simple, something like:

kernel void UpdateH(float4 Ez[], float2 H[], float4 coH[], uint2 img, uint2 block ) {

	int ix =blockDim.x * blockIdx.x + threadIdx.x;

	int iy =blockDim.y * blockIdx.y + threadIdx.y;

		int id = ix*img.y+iy;


	H[id]=float2(co.x*H[id].x + co.y*(eo-Ez[id+1].x-Ez[id+1].y),

				 co.z*H[id].y + co.w*(Ez[id+img.y].x+Ez[id+img.y].y-eo));


My domain is a 512x512 grid (uint2 img={512,512}). The gridDim={128,128}, the blockDim={4,4}. From my understanding to the coalesced access, it is the Ez[id+1] term that had ruined the alignment and thus, turn this into 16 individual thread read (but I still don’t understand why there is not a single coalesced read? shouldn’t H[id] terms all coalesced? )

My question is, given such a simple operation, is there a piratical way (by using shared memory, for example) to make this operator coalesced in global memory read/write?

thanks in advance

Good question, have you read this?…pu_3dfd_rev.pdf

Fastest way (beside reading Paulius article as Simon suggested) would be to use textures to read the Ez data.

Also instead of writing H[id].x and then H[id].y try to first load it to a float2 register and in the line that updates

H[id] use this register - the compiler might or might not issue two read operations using the original code.