It seems to be logical, going by the description but…
the results reported here are reproducable only for the mentioned configuration and only on the tesla 1060…
can anyone else confirm that… (different config/device ??)
I came up against this issue when writing some code that accessed a matrix in a block-columnwise manner. Runtimes changed by over a factor of two for certain matrix sizes even though all memory operations were coalesced, and different cards had the slowdowns for different matrix sizes corresponding to the width of their memory interface. Changing the code to act in a block-rowwise manner mitigated the effect.
thanks a lot for that, this exactly the problem I was stuck up with - optimal memory access patterns for global memory.
But, I guess the basic thing that has to be done is to make sure that multiple threads do not access the same memory partition simultaneously, in that case i presume the accesses are serialized (causing the slowdown…). but, the memroy architecture is different for different cards and so coing up with a genric solution to avoid this - is not that easy. In my kernel i used the diagonlize-ordered access, this makes sure that both simultaneous reads and write of teh multuiple threads happen acrros the different partitions.
But, even with this I was not able to see the improvement in - certain cards (tesla c870) but in case of the c1060 this solwdown is greatly mitigated. I was hoping someone could also confirm this and maybe explain to why this is happening. This can be clearly seen, in the diagonlized transpose code (available at the link in the above post) when used on the c870, if anyone could check it out it would be great
But, suprsingly there is no mention of this issue in any of the NVIDIA literature or even the programming guide. It would be great if NVIDIA could include this in theri next version, taht would relive us of this guess-work… :rolleyes:
Thanks for posting. I too can confirm this. Simply changing a grid size from (64 x 1024) to (1024 x 64) increased performance for me by a factor of 2.5 for a particular kernel on the Tesla C1060. Block-columnwise vs. block-row wise does make a difference! All the multiprocessors are trying to access the same global memory partition in the block-columnwise implementation. This “partition camping” problem is worse for GPU with a large number of multiprocessors (e.g. Tesla).
Partition camping was almost negligible for the GT 240 for this particular kernel.
Moral of the story: arrange your thread blocks in a row-wise manner.