Coalesced access with blocks width shorter than 16

In page 55 of the CUDA programming guide (section I read that the width of the thread block must be a multiple of half the wrap size (32/2=16 in my case) in order to have coalesced memory access when a thread with ID (tx, ty) accesses an element of a 2D array located at BaseAddress+width*ty+tx (width is the width of the array, not of the thread block).
My question is, wouldn’t I also get coalesced memory access if the array had, for example dimensions (Nx, Ny)=(8, 160) with a block of size (bNx, bNy)=(8, 32) instead of (16, 16)?

No, you wouldn’t. The minimum coalesced read is 16 threads (half warp).

So the question now is: Would I get coalesced access for an array of size (Nx, Ny)=(8, 160) with a block of size (16, 16)?

Sure, if you read two rows at once per half-warp, and you make sure the even rows are aligned properly in memory.

I still don’t understand how can I make sure that each half-wrap reads two rows. Do I have control over which threads go to each half-wrap?

Read the programming guide section on how the threads are indexed, it is very clearly spelled out there how threads are assigned to warp by threadIdx.x, .y, and .z.