Execution configuration question

Hey guys I had a question about the dimension and size of the grid ie Db in <<<Dg, Db, Ns>>>

for my program when I allocate more threads in the x and y direction I get better performance compared to when I give more to in z.
For example (16,16,2) is much faster compared to (2,2,16)

Could somebody explain the reasoning behind this.

Thanks in advance

How are you using the thread indices in your kernel? One configuration might be looping over memory in a different order than the other.

Also, your two example cases have different numbers of threads. Do you mean (16,2,2) compared to (2,2,16)?

Yeah sorry that was a typo, it is (16,16,2) & (2,16,16) or a better example could be (64,2,2) is much faster than (2,2,64). What exactly do you mean by using thread indices in the kernel? I am dealing with 3D input streams and I get their indices from the resp directions and put them in a 1D array and then execute.

Thanks

How do you combine the the x,y,z thread indices into the 1D array index? I’m wondering if there is something about the indexing which is breaking coalescing.