I’m seeing effects that appear to be partition camping in that the global bandwidth seems to vary significantly with the access pattern (even though all the access patterns are fully coallesced). The strange thing is that the results are the complete opposite of the documented behaviour and of what I have experienced on other hardware. As I understand it, the Tesla T10 in the S1070 has 8 partitions each with a width of 256 bytes. If I have a block of 512 threads each accessing successive 4 byte values then each individual block is distributing its accesses perfectly over all 8 partitions. This should be completely immune to partition camping and should acheive the full memory bandwidth regardless of the ordering of different blocks but in fact I am only getting about 44GB/s of bandwidth. On the other hand, if I have block of 512 threads where threads within a warp access successive 4 byte values but all warps in the block access the same partition then this should be very sensitive to the ordering of different blocks. And yet, I get 60GB/s of bandwidth. Does anyone have any kind of explanation?
Edit: Okay, so it turns out I’m probably not dealing with partition camping at all. Having studied the way the bandwidth varies with the dimensions of my problem I think I’m actually dealing with TLB misses.