Global memory coalescing Poor write to global memory

Hi guys, I have a pretty simple kernel that calculates the Gaussian prob of points with respect to a cluster, but I am getting very poor performance when it comes to writing to global memory.


global void calculateGauss(float *points, float *clusterMean, float *ownershipCluster)


/// evaluate distance between point and cluster mean

/// load mean and variance etc to shared memory

/// evalGauss is a device kernel that calculates the prob

tid = __umul24(blockIdx.x,blockDim.x)+threadIdx.x

float gauss = evalGauss(distance,clusterVariance);

///OwnershipCluster is an array of floats that i have declared in global memory

ownershipCluster[tid] = gauss;


I am running this over 30720 (60 blocks with 512 threads) points which means I am writing 30720*4bytes = ~120k to global memory. However this kernel is taking over 0.1 seconds to execute on a GTX285, with nearly all of the time being in the “ownershipCluster[tid] = gauss;” line where I am writing to global memory.

Conversely, if I run the same kernel but change to ‘ownershipCluster[tid] = 3.3f’ or any hard coded float, performance improves to 0.005 seconds. In the programming guide, it states to improve performance we need to use 32, 64 or 128 byte, does this mean to improve performance I should be copying 8 ,16 or 32 floats at once? If this is true, why is copying a hard coded float such as 3.3f so fast? Thanks a lot.

This has nothing to do with memory access. Rather, the compiler is clever enough to optimize away the whole call to evalGauss() if the result is not used anywhere.