Coalesced vs non-coalesced in reduction example Why float4 is not coalesced

Ladies & Gentlemen around here,

I came across very peculiar behaviour which I cannot really understand. I need to implement a reduction of a large data set (1000x1000 or more). There is an excellent example of the reduction in CUDA SDK which I used as a starting point.

However, in my case, I need to calculate a maximal value, instead of the sum, of a function which depends on the 4 variables. These 4 variables are stored in float4 format in the data-set.

In other words, I need to find

[font=“Courier”]value = max(f(xi,yi,zi,wi)),[/font] where i-ranges for all elements in data-set.

I used reduce6 kernel from reduction example from CUDA_SDK (cuda-sdk/projects/reudction/reduction_kernel.cu)

The changes are simple, and everywhere I replaced “x += y” to “x = fmaxf(x,y)”. Another change was in the following loop:

[font=“Courier”] // we reduce multiple elements per thread. The number is determined by the
// number of active thread blocks (via gridSize). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
[/font]

I replaced it with

[font=“Courier”]
while (i < n) {
float4 el = f4_idata[i];
float v = f(el.x, el.y, el.z, el.w);

el = f4_idata[i+blockSize];
v = fmax(v, f(el.x, el.y, el.z, el.w);
sdata[tid] = fmaxf(sdata[tid], v);

i += gridSize;
}
__syncthreads();
[/font]

It works as expected. I was expecting that coalescing would be preserved by construction, but after cuda-profiling the code, it turns out that in the latter case all the global loads are non-coalesced (non-coherent)

here is extract from cuda_profile.log:
method=[ _Z14dev_compute_csILi256EEviP6float4Pf ] gputime=[ 35762.594 ] cputime=[ 35776.000 ] occupancy=[ 0.667 ] gld_incoherent=[8716288] gld_coherent=[ 0 ] gst_incoherent=[ 62 ] gst_coherent=[ 4 ]

As a test, I split float4 into 4 independent arrays and run a kernel with the following loop:
[font=“Courier”]
while (i < n) {
float v = f(x[i], y[i], z[i], w[i]);

int j = i + blockSize;
v = fmax(v, f(x[j], y[j], z[j], w[j]));
sdata[tid] = fmaxf(sdata[tid], v);

i += gridSize;
}
__syncthreads();
[/font]
Here, x, y, z & w are all float. As expected all the reads are coalesced:
method=[ Z14dev_compute_csILi64EEviPfS0_S0_S0_S0 ] gputime=[ 17121.695 ] cputime=[ 17133.000 ] occupancy=[ 0.500 ] gld_incoherent=[ 0 ] gld_coherent=[ 817280 ] gst_incoherent=[ 10 ] gst_coherent=[ 4 ]

and it runs 2x faster!

the data set size is 7216x2416 which results in 16GB/s in the float case & 8GB/s in float4 case (there are about 20 flops + 1 fmaxf + 2x squareroot in f-function).

I’d like to use float4 because other parts of the code are more efficient with float4-array rather than with 4-float arrays. Moreover, in other kernels of the same code, the float4-reads are coalesced.

Unfortunately, I failed to find the cause what is going on, where is the bug in the algorithm and how to fix it.

If some of you have ideas & suggestions, feel free to suggest! I’m very curious what I am missing here.

Cheers,
Evghenii

Added: The runs were carried out on both 8800Ultra & 8800GTS(512)

PS: The float4-kernel was launched with grid of 64, 128 & 256 blocks & with blockSize=64, 128 & 256. The result are all the same, reads are not coalesced. The extract from cuda profile was from the kernel launch where grid had 256 blocks & blockSize was 256. The 4-float array with coalesced reads used blockSize=128 threads= (128,1,1) & grid = (64,1,1,)

Which version of CUDA are you using?

What does the code for function f() look like? Is it possible that one of the vector components don’t get used? One possibility is that the compiler somehow splits up the loading of the 4 components into several reads, thus breaking coalescing.

Can you minimize the code, still keeping the coalescing behavior, and post the copy here (hint: the smaller the repro case, the greater chance it will get expedient attention)?

Paulius