Ladies & Gentlemen around here,

I came across very peculiar behaviour which I cannot really understand. I need to implement a reduction of a large data set (1000x1000 or more). There is an excellent example of the reduction in CUDA SDK which I used as a starting point.

However, in my case, I need to calculate a maximal value, instead of the sum, of a function which depends on the 4 variables. These 4 variables are stored in float4 format in the data-set.

In other words, I need to find

value = max(f(xi,yi,zi,wi)), where i-ranges for all elements in data-set.

I used reduce6 kernel from reduction example from CUDA_SDK (cuda-sdk/projects/reudction/reduction_kernel.cu)

The changes are simple, and everywhere I replaced â€śx += yâ€ť to â€śx = fmaxf(x,y)â€ť. Another change was in the following loop:

// we reduce multiple elements per thread. The number is determined by the

// number of active thread blocks (via gridSize). More blocks will result

// in a larger gridSize and therefore fewer elements per thread

while (i < n)

{

sdata[tid] += g_idata[i] + g_idata[i+blockSize];

i += gridSize;

}

__syncthreads();

I replaced it with

while (i < n) {

float4 el = f4_idata[i];

float v = f(el.x, el.y, el.z, el.w);

el = f4_idata[i+blockSize];

v = fmax(v, f(el.x, el.y, el.z, el.w);

sdata[tid] = fmaxf(sdata[tid], v);

i += gridSize;

}

__syncthreads();

It works as expected. I was expecting that coalescing would be preserved by construction, but after cuda-profiling the code, it turns out that in the latter case all the global loads are non-coalesced (non-coherent)

here is extract from cuda_profile.log:

method=[ _Z14dev_compute_csILi256EEviP6float4Pf ] gputime=[ 35762.594 ] cputime=[ 35776.000 ] occupancy=[ 0.667 ] gld_incoherent=[8716288] gld_coherent=[ 0 ] gst_incoherent=[ 62 ] gst_coherent=[ 4 ]

As a test, I split float4 into 4 independent arrays and run a kernel with the following loop:

while (i < n) {

float v = f(x[i], y[i], z[i], w[i]);

int j = i + blockSize;

v = fmax(v, f(x[j], y[j], z[j], w[j]));

sdata[tid] = fmaxf(sdata[tid], v);

i += gridSize;

}

__syncthreads();

Here, x, y, z & w are all float. As expected all the reads are coalesced:

method=[ *Z14dev_compute_csILi64EEviPfS0_S0_S0_S0* ] gputime=[ 17121.695 ] cputime=[ 17133.000 ] occupancy=[ 0.500 ] gld_incoherent=[ 0 ] gld_coherent=[ 817280 ] gst_incoherent=[ 10 ] gst_coherent=[ 4 ]

and it runs 2x faster!

the data set size is 7216x2416 which results in 16GB/s in the float case & 8GB/s in float4 case (there are about 20 flops + 1 fmaxf + 2x squareroot in f-function).

Iâ€™d like to use float4 because other parts of the code are more efficient with float4-array rather than with 4-float arrays. Moreover, in other kernels of the same code, the float4-reads are coalesced.

Unfortunately, I failed to find the cause what is going on, where is the bug in the algorithm and how to fix it.

If some of you have ideas & suggestions, feel free to suggest! Iâ€™m very curious what I am missing here.

Cheers,

Evghenii

Added: The runs were carried out on both 8800Ultra & 8800GTS(512)

PS: The float4-kernel was launched with grid of 64, 128 & 256 blocks & with blockSize=64, 128 & 256. The result are all the same, reads are not coalesced. The extract from cuda profile was from the kernel launch where grid had 256 blocks & blockSize was 256. The 4-float array with coalesced reads used blockSize=128 threads= (128,1,1) & grid = (64,1,1,)