Bizzare S1070 performance

I’ve been having some strange performance issues with kernels on Tesla C1060 cards (part of S1070 units in a cluster - the Lincoln supercomputer at NCSA). The same kernels perform nice and consistently on a different server on both a Tesla C870 and a GTX260. I don’t really see anything in my code that explains why it performs fine on these cards, but oddly on the C1060. I realize there are many many variables here and it’s difficult to compare the systems - but I’m stumped and cannot explain the results that I am observing. Both are 64-bit linux with gcc 4.2.4 and 2.3 of the CUDA SDK.

The trend should be quadratic as the clusters increase. The C870 and GTX260 running the same kernel on the same input data haven nice smooth curves, whereas the C1060 is erratic, with particularly large spikes at 128 and 256

External Media

I realize the S1070 is a NUMA device and I’ve read about bandwidth issues and processor affinity issues, so I could possibly understand if I was seeing erratic performance in D->H or H->D memcpying. Can that possibly effect global memory access times within a kernel on a single device? I had assumed that each C1060 GPU had its own private 4GB of global memory so I don’t really see how that would be the case. There are 2 devices from each S1070 connected to each compute node and I tried both of the devices, but they exhibit the same behavior.

The kernel giving me problems is very simple. It is computing fuzzy memberships in a clustering algorithm based on distances in a large contiguous float array and saving them to global memory. It is heavily dependent on global memory access. It’s launched with a dim3(NUM_CLUSTERS,NUM_EVENTS/NUM_THREADS) grid.


global void ComputeMembershipMatrix2(float* distances, float* memberships) {

float membershipValue;

int i = blockIdx.y * blockDim.x + threadIdx.x;

if(i < NUM_EVENTS) {

    membershipValue = MembershipValueGPU(blockIdx.x, i, distances);

    memberships[blockIdx.x*NUM_EVENTS+i] = membershipValue*membershipValue;



device float MembershipValueGPU(int clusterIndex, int eventIndex, float* distanceMatrix){

float myClustDist = 0.0f;

myClustDist = distanceMatrix[clusterIndex*NUM_EVENTS+eventIndex];

float sum = 0.0f;

float otherClustDist;

for(int j = 0; j< NUM_CLUSTERS; j++){

    otherClustDist = distanceMatrix[j*NUM_EVENTS+eventIndex];

    sum += pow((myClustDist/otherClustDist),(2/(FUZZINESS-1)));


return 1.0f/sum;



On C1060:

With NUM_CLUSTERS = 124:

method=[ Z24ComputeMembershipMatrix2PfS ] gputime=[ 703230.750 ] cputime=[ 703232.000 ] occupancy=[ 1.000 ] gld_32b=[ 0 ] gld_64b=[ 50790000 ] gst_32b=[ 0 ] gst_64b=[ 406320 ]

With NUM_CLUSTERS = 128:

method=[ Z24ComputeMembershipMatrix2PfS ] gputime=[ 1321734.250 ] cputime=[ 1321726.125 ] occupancy=[ 1.000 ] gld_32b=[ 0 ] gld_64b=[ 54105696 ] gst_32b=[ 0 ] gst_64b=[ 419424 ]

On GTX260:

With NUM_CLUSTERS = 124:

method=[ Z24ComputeMembershipMatrix2PfS ] gputime=[ 677820.812 ] cputime=[ 677801.000 ] occupancy=[ 1.000 ] gld_32b=[ 0 ] gld_64b=[ 63488000 ] gst_32b=[ 0 ] gst_64b=[ 507904 ]

With NUM_CLUSTERS = 128:

method=[ Z24ComputeMembershipMatrix2PfS ] gputime=[ 722165.250 ] cputime=[ 722145.000 ] occupancy=[ 1.000 ] gld_32b=[ 0 ] gld_64b=[ 67633152 ] gst_32b=[ 0 ] gst_64b=[ 524288 ]

Both show comparable changes in the number of memory accesses.

I also checked warp_serialize and divergent_branch counters - they’re all 0

Yet the C1060 has a huge jump in execution time…

Normally with a roughly factor of 2 increase in execution time like that I’d suspect an alignment issue with global memory access - but wouldn’t I see double the number of gld_64b’s (or some gld_32b) if that was the case?

The number of events is a nice even number ( 2^19 in this case), so there should not be any alignment issues for coalesced memory access as far as I know. The number of threads is 256.

I’m beginning to figure its not code related and something outside my control but I really don’t have any idea how to explain it. Just looking for some thoughts/opinions/ideas of what else to check.

Sorry for the wall of text :">


This smells like partition camping.


Ah, THANK YOU very much - that did seem to be the problem.

I tried the diagonalize solution in that paper but it didn’t cause any improvement. I think that might be because each block actually accesses an entire 256 columns of the matrix, rather than just a 256x256 block.

The way I had my grid setup, the blocks were traversing chunks of the matrix in a column-major fashion, with each block accessing the same 256 columns (1024 bytes so 4 partitions I guess). Simply switching the two dimensions of my grid seemed to alleviate the problem.