Ladies & Gentlemen around here,
I came across very peculiar behaviour which I cannot really understand. I need to implement a reduction of a large data set (1000x1000 or more). There is an excellent example of the reduction in CUDA SDK which I used as a starting point.
However, in my case, I need to calculate a maximal value, instead of the sum, of a function which depends on the 4 variables. These 4 variables are stored in float4 format in the data-set.
In other words, I need to find
[font=“Courier”]value = max(f(xi,yi,zi,wi)),[/font] where i-ranges for all elements in data-set.
I used reduce6 kernel from reduction example from CUDA_SDK (cuda-sdk/projects/reudction/reduction_kernel.cu)
The changes are simple, and everywhere I replaced “x += y” to “x = fmaxf(x,y)”. Another change was in the following loop:
[font=“Courier”] // we reduce multiple elements per thread. The number is determined by the
// number of active thread blocks (via gridSize). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
[/font]
I replaced it with
[font=“Courier”]
while (i < n) {
float4 el = f4_idata[i];
float v = f(el.x, el.y, el.z, el.w);
el = f4_idata[i+blockSize];
v = fmax(v, f(el.x, el.y, el.z, el.w);
sdata[tid] = fmaxf(sdata[tid], v);
i += gridSize;
}
__syncthreads();
[/font]
It works as expected. I was expecting that coalescing would be preserved by construction, but after cuda-profiling the code, it turns out that in the latter case all the global loads are non-coalesced (non-coherent)
here is extract from cuda_profile.log:
method=[ _Z14dev_compute_csILi256EEviP6float4Pf ] gputime=[ 35762.594 ] cputime=[ 35776.000 ] occupancy=[ 0.667 ] gld_incoherent=[8716288] gld_coherent=[ 0 ] gst_incoherent=[ 62 ] gst_coherent=[ 4 ]
As a test, I split float4 into 4 independent arrays and run a kernel with the following loop:
[font=“Courier”]
while (i < n) {
float v = f(x[i], y[i], z[i], w[i]);
int j = i + blockSize;
v = fmax(v, f(x[j], y[j], z[j], w[j]));
sdata[tid] = fmaxf(sdata[tid], v);
i += gridSize;
}
__syncthreads();
[/font]
Here, x, y, z & w are all float. As expected all the reads are coalesced:
method=[ Z14dev_compute_csILi64EEviPfS0_S0_S0_S0 ] gputime=[ 17121.695 ] cputime=[ 17133.000 ] occupancy=[ 0.500 ] gld_incoherent=[ 0 ] gld_coherent=[ 817280 ] gst_incoherent=[ 10 ] gst_coherent=[ 4 ]
and it runs 2x faster!
the data set size is 7216x2416 which results in 16GB/s in the float case & 8GB/s in float4 case (there are about 20 flops + 1 fmaxf + 2x squareroot in f-function).
I’d like to use float4 because other parts of the code are more efficient with float4-array rather than with 4-float arrays. Moreover, in other kernels of the same code, the float4-reads are coalesced.
Unfortunately, I failed to find the cause what is going on, where is the bug in the algorithm and how to fix it.
If some of you have ideas & suggestions, feel free to suggest! I’m very curious what I am missing here.
Cheers,
Evghenii
Added: The runs were carried out on both 8800Ultra & 8800GTS(512)
PS: The float4-kernel was launched with grid of 64, 128 & 256 blocks & with blockSize=64, 128 & 256. The result are all the same, reads are not coalesced. The extract from cuda profile was from the kernel launch where grid had 256 blocks & blockSize was 256. The 4-float array with coalesced reads used blockSize=128 threads= (128,1,1) & grid = (64,1,1,)