I have an FDTD code written in CUDA. I ran the code on 8800GT and GTX 295, I found the speed of 295 is about 15x faster. Using visual profiler, I found on 8800GT, all of the gld and gst operations are non-coalesced. I think the non-coalescing of the global memory access is guilty for the slow-down.
I am seeking advices to convert my kernel to coalesced access, to gain 16x fold speed-up on 8800 and likely an additional 2x fold on 295.
My kernel is quite simple, something like:
kernel void UpdateH(float4 Ez[], float2 H[], float4 coH[], uint2 img, uint2 block ) {
int ix =blockDim.x * blockIdx.x + threadIdx.x;
int iy =blockDim.y * blockIdx.y + threadIdx.y;
int id = ix*img.y+iy;
...
H[id]=float2(co.x*H[id].x + co.y*(eo-Ez[id+1].x-Ez[id+1].y),
co.z*H[id].y + co.w*(Ez[id+img.y].x+Ez[id+img.y].y-eo));
}
My domain is a 512x512 grid (uint2 img={512,512}). The gridDim={128,128}, the blockDim={4,4}. From my understanding to the coalesced access, it is the Ez[id+1] term that had ruined the alignment and thus, turn this into 16 individual thread read (but I still don’t understand why there is not a single coalesced read? shouldn’t H[id] terms all coalesced? )
My question is, given such a simple operation, is there a piratical way (by using shared memory, for example) to make this operator coalesced in global memory read/write?