To coalesce, thread 0 of a half-warp needs to read memory location [N16+0], thread 1, needs to read [N16+1], etc. You are reading memory locations SegmentLen apart between the threads.
If you read the programming manual (or check devicequery), you can launch a grid of size 65535 x 65535. So yes, you can easily launch such a grid.
You cannot have more than x blocks at the same time on 1 multiprocessor (depends on block size and amount of registers used & shared memory). Other blocks get scheduled as soon as a block is finished. To get optimal performance out of CUDA, you normally want to launch thousands of blocks (or at least as much as possible)
In my 2048x2048 case, SegmentLen is 32 = 16*2; yet the profiler tells me I have fully uncoalesced reads and writes, and indeed the relative performances are no better than the 1024x1024 case (4x worse in fact, which is linear worsening).
Anyway this evening I’ll try the new code, thanks! External Media