Warp partioning and scheduling for two dimensional grid size


to optimize the memory access of my kernels I would like to understand how the warps are scheduled on the SMs.

In the Programming Guide it says:

My quesion now is:

Suppose I invoke a two dimensional kernel execution.

What does “consecutive, increasing thread IDs” mean? Consecutive in which dimension?

I need to know that, because I would like to know which IDs are scheduled to run concurrently, in order to optimize the memory accesses.