Partition camping ?

Can anybody shed some more light on this concept of partition camping…

http://cs.anu.edu.au/systems/GPUWksp/PDFs/…ngCUDA_full.pdf

Effective Bandwidth (GB/s), 2048x2048, GTX 280
Simple Copy : 96.9
Shared Memory Copy : 80.9
Naïve Transpose : 2.2
Coalesced Transpose : 16.5
Bank Conflict Free Transpose : 16.6
Diagonal : 69.5 :o

It seems to be logical, going by the description but…
the results reported here are reproducable only for the mentioned configuration and only on the tesla 1060…
can anyone else confirm that… (different config/device ??)

thanks…

Alex Dubinsky did some interesting work optimizing memory bandwidth, and posted some results here. Bottom line: spread the accesses evenly among the channels (partitions) to maximize throughput.

Hi there,

I came up against this issue when writing some code that accessed a matrix in a block-columnwise manner. Runtimes changed by over a factor of two for certain matrix sizes even though all memory operations were coalesced, and different cards had the slowdowns for different matrix sizes corresponding to the width of their memory interface. Changing the code to act in a block-rowwise manner mitigated the effect.

See the discussion in this thread.

Best,
Steven.

Hi steven.

thanks a lot for that, this exactly the problem I was stuck up with - optimal memory access patterns for global memory.

But, I guess the basic thing that has to be done is to make sure that multiple threads do not access the same memory partition simultaneously, in that case i presume the accesses are serialized (causing the slowdown…). but, the memroy architecture is different for different cards and so coing up with a genric solution to avoid this - is not that easy. In my kernel i used the diagonlize-ordered access, this makes sure that both simultaneous reads and write of teh multuiple threads happen acrros the different partitions.

But, even with this I was not able to see the improvement in - certain cards (tesla c870) but in case of the c1060 this solwdown is greatly mitigated. I was hoping someone could also confirm this and maybe explain to why this is happening. This can be clearly seen, in the diagonlized transpose code (available at the link in the above post) when used on the c870, if anyone could check it out it would be great :thumbup:

But, suprsingly there is no mention of this issue in any of the NVIDIA literature or even the programming guide. It would be great if NVIDIA could include this in theri next version, taht would relive us of this guess-work… :rolleyes:

thanks…

Thanks for posting. I too can confirm this. Simply changing a grid size from (64 x 1024) to (1024 x 64) increased performance for me by a factor of 2.5 for a particular kernel on the Tesla C1060. Block-columnwise vs. block-row wise does make a difference! All the multiprocessors are trying to access the same global memory partition in the block-columnwise implementation. This “partition camping” problem is worse for GPU with a large number of multiprocessors (e.g. Tesla).

Partition camping was almost negligible for the GT 240 for this particular kernel.

Moral of the story: arrange your thread blocks in a row-wise manner.

Its very much the same issue as with your common bank conflicts in shared memory.

TransposeNew in the SDK has som great documentation on it.