Fermi and partition camping

Is partition camping an issue for Fermi cards? I’ve noticed that for the matrix transpose example in SDK, there’s not much performance difference between coarse-grained and fine-grained transposes.

No, Fermi changed the design of the memory controllers and should not be affected by partition camping. Hooray!

Did someone have technical explanation about it?

Because anyway you put it, there’s bandwidth limitation for each memory controller, and if ONE is used for all memory accesses, there should be a huge penalty, unless partitioning is dynamically done and fine-grained (that is really improbable).

Did someone have technical explanation about it?

Because anyway you put it, there’s bandwidth limitation for each memory controller, and if ONE is used for all memory accesses, there should be a huge penalty, unless partitioning is dynamically done and fine-grained (that is really improbable).

I can’t really go into details, but there is no longer a linear mapping between addresses and partitions, so typical access patterns are unlikely to all fall into the same partition.

I can’t really go into details, but there is no longer a linear mapping between addresses and partitions, so typical access patterns are unlikely to all fall into the same partition.

If you are curious about how this can be done, you can look at papers from the 80’s or 90’s about memory interleaving schemes, like skewing or random interleaving…

If you are curious about how this can be done, you can look at papers from the 80’s or 90’s about memory interleaving schemes, like skewing or random interleaving…

I know some, but it just handle typical access patterns, they don’t warranty that in any particular case there won’t be partition camping!

Avoiding it on typical case is relatively easy, but ensuring that it’s immune is wrong, from my point-of-view, until you dynamically MOVE memory content and remap them dynamically with a load analyse on each memory controller. That’s my point.

I know some, but it just handle typical access patterns, they don’t warranty that in any particular case there won’t be partition camping!

Avoiding it on typical case is relatively easy, but ensuring that it’s immune is wrong, from my point-of-view, until you dynamically MOVE memory content and remap them dynamically with a load analyse on each memory controller. That’s my point.

It seems that Fermi is still affected by partition camping, although in a lighter degree. Here is the performance I get in transpose:

It seems that Fermi is still affected by partition camping, although in a lighter degree. Here is the performance I get in transpose:

Interesting…

Is this for 32-bit data, 1 output per thread? That would suggest that conflicts appear with a stride of 1536B. (Seems to match nicely GT200-like partitions, handling blocks of 256B each…)

Did you try with other cache configurations (16K/48K) or caching policies (-Xptxas -dlcm…)? Just to be sure to rule out any cache-related effect…

Interesting…

Is this for 32-bit data, 1 output per thread? That would suggest that conflicts appear with a stride of 1536B. (Seems to match nicely GT200-like partitions, handling blocks of 256B each…)

Did you try with other cache configurations (16K/48K) or caching policies (-Xptxas -dlcm…)? Just to be sure to rule out any cache-related effect…

It is for 32-bit data, 4 outputs per thread. But I see same effect when computing 1 output per thread and when compiling with “-Xptxas -dlcm=cg”.

It is for 32-bit data, 4 outputs per thread. But I see same effect when computing 1 output per thread and when compiling with “-Xptxas -dlcm=cg”.

Hello guys,

these posts are from August 2010, is the issue solved, i.e. partition camping still happens with Fermi architecture or not ? :)