I’ve tested my kernels and saw that the same code running on 2D blocks went faster than 1D blocks, when the blocks have the same number of threads.
My question is why ? (if any has the answer)
I have not seen any difference in any of the kernels I’ve tested.
Are your kernels compute or memory bound?
They spent most of their time doing computing, with some memory accesses.
They follow the MatrixMul example pattern : loading data into shared memory, computing the results, writing it back to global memory.
Are you reading data from texture to shared memory? Since texture memory is optimized for 2D locality, the hardware architecture might give you a better cache utilization with 2D blocks. Specially if you have several stages like: reading to shared memory, computing and uploading back to texture memory.
Although I’m not sure if it should improve your writes to texture memory as well…
Yeah, I use texture memory. First, I did not notice any speed up when using 1D blocks. But with 2D blocks, the hardware architecture is involved, for sure. But I haven’t tried to right back into texture memory, I use the global memory pointer instead.
Good idea to test, thanx for the suggestion…
Sorry clement, I meant to say global memory instead of texture memory, which is read only (as you may already know) :blink:
The reasoning I was talking about would be that the texture cache would alleviate the memory bus from reads and, therefore, you would get a better performance when writing to global memory.
Ok, thank you dude !
Can you tell us how much performance improvement you get using 2D blocks instead 1D block?
On a run with 65536 cells, it took 75s instead of 90s (17% faster). I haven’t tested with a higher number of cells (because other problems appeared), but I think larger domain would give better improvement.
I wonder if using 16x16 blocks makes less bank acess conflicts than 256x1 blocks…