Hey guys I had a question about the dimension and size of the grid ie Db in <<<Dg, Db, Ns>>>
for my program when I allocate more threads in the x and y direction I get better performance compared to when I give more to in z.
For example (16,16,2) is much faster compared to (2,2,16)
Yeah sorry that was a typo, it is (16,16,2) & (2,16,16) or a better example could be (64,2,2) is much faster than (2,2,64). What exactly do you mean by using thread indices in the kernel? I am dealing with 3D input streams and I get their indices from the resp directions and put them in a 1D array and then execute.
How do you combine the the x,y,z thread indices into the 1D array index? I’m wondering if there is something about the indexing which is breaking coalescing.