why am I not seeing bank conflict effects on a gtx 285?

I’m testing the matrix transpose samples (transposeNew) and on the geforce gtx 285 for some reason I’m not seeing the effects of bank conflicts. Any explanation? In fact the no bank conflict code actually runs a bit slower

Thanks

transposeNew-Outer-coalesced transpose , Throughput = 18.5979 GB/s, Time = 0.42008 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transposeNew-Inner-coalesced transpose , Throughput = 20.8583 GB/s, Time = 0.37455 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Outer-no bank conflict trans, Throughput = 18.9127 GB/s, Time = 0.41308 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
transposeNew-Inner-no bank conflict trans, Throughput = 20.8786 GB/s, Time = 0.37419 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

There is something else wrong there. On a GTX-275 I get this:

transposeNew-Outer-coalesced transpose   , Throughput = 32.3944 GB/s, Time = 0.24117 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Inner-coalesced transpose   , Throughput = 70.6671 GB/s, Time = 0.11055 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Outer-no bank conflict trans, Throughput = 44.1011 GB/s, Time = 0.17715 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Inner-no bank conflict trans, Throughput = 70.8977 GB/s, Time = 0.11019 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

Your GTX275 has roughly 50% more memory bandwidth than a GTX275. You probably should be seeing numbers up over 100Gb/s for the no-conflict cases. What does the simple copy version of the transpose show on your card?

There is still the effect of partition camping in this example. I’ve gone up to around 120GB/s on a simple copy (give or take a bit depending on the version). Interestingly I get better copy performance utilizing textures by the way (on the tesla coalesced copy gets me 75GB/s, with textures 110GB/s).

I’ve also re-written the code myself with the same result and tested as well on a laptop with nvs 140m which does show bank conflict effects, and on linux with tesla s1070 which doesn’t, so it’s something with the architecture of the g200/t10 as far as I can tell and not the code. It’s just against all documentation and claims, so I don’t understand what’s happening (unless scheduling is able to hide bank conflicts somehow)

The full output is:

transposeNew-Outer-simple copy , Throughput = 77.7585 GB/s, Time = 0.10047 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Inner-simple copy , Throughput = 113.9901 GB/s, Time = 0.06854 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Outer-shared memory copy , Throughput = 51.1425 GB/s, Time = 0.15276 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Inner-shared memory copy , Throughput = 96.5273 GB/s, Time = 0.08094 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Outer-naive transpose , Throughput = 2.5903 GB/s, Time = 3.01609 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Inner-naive transpose , Throughput = 2.6728 GB/s, Time = 2.92296 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Outer-coalesced transpose , Throughput = 18.5979 GB/s, Time = 0.42008 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Inner-coalesced transpose , Throughput = 20.8583 GB/s, Time = 0.37455 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Outer-no bank conflict trans, Throughput = 18.9127 GB/s, Time = 0.41308 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Inner-no bank conflict trans, Throughput = 20.8786 GB/s, Time = 0.37419 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Outer-coarse-grained , Throughput = 18.9515 GB/s, Time = 0.41224 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Inner-coarse-grained , Throughput = 20.8824 GB/s, Time = 0.37412 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Outer-fine-grained , Throughput = 81.3469 GB/s, Time = 0.09604 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Inner-fine-grained , Throughput = 98.6893 GB/s, Time = 0.07916 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Outer-diagonal transpose , Throughput = 29.7126 GB/s, Time = 0.26294 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

transposeNew-Inner-diagonal transpose , Throughput = 109.0551 GB/s, Time = 0.07164 s, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256

solved the problem, it was hiding behind partition camping. I introduced bank conflicts into the version with no partition camping and then I saw the bank conflicts in action