Coalesced vs non-coalesced in reduction example Why float4-reads are not coalesced?

Ladies & Gentlemen around here,

I came across very peculiar behaviour which I cannot really understand. I need to implement a reduction of a large data set (1000x1000 or more). There is an excellent example of the reduction in CUDA SDK which I used as a starting point.

However, in my case, I need to calculate a maximal value, instead of the sum, of a function which depends on the 4 variables. These 4 variables are stored in float4 format in the data-set.

In other words, I need to find

value = max(f(xi,yi,zi,wi)), where i-ranges for all elements in data-set.

I used reduce6 kernel from reduction example from CUDA_SDK (cuda-sdk/projects/reudction/reduction_kernel.cu)

The changes are simple, and everywhere I replaced “x += y” to “x = fmaxf(x,y)”. Another change was in the following loop:

// we reduce multiple elements per thread. The number is determined by the
// number of active thread blocks (via gridSize). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();

I replaced it with

while (i < n) {
float4 el = f4_idata[i];
float v = f(el.x, el.y, el.z, el.w);

el = f4_idata[i+blockSize];
v = fmax(v, f(el.x, el.y, el.z, el.w);
sdata[tid] = fmaxf(sdata[tid], v);

i += gridSize;
}
__syncthreads();

It works as expected. I was expecting that coalescing would be preserved by construction, but after cuda-profiling the code, it turns out that in the latter case all the global loads are non-coalesced (non-coherent)

here is extract from cuda_profile.log:
method=[ _Z14dev_compute_csILi256EEviP6float4Pf ] gputime=[ 35762.594 ] cputime=[ 35776.000 ] occupancy=[ 0.667 ] gld_incoherent=[8716288] gld_coherent=[ 0 ] gst_incoherent=[ 62 ] gst_coherent=[ 4 ]

As a test, I split float4 into 4 independent arrays and run a kernel with the following loop:

while (i < n) {
float v = f(x[i], y[i], z[i], w[i]);

int j = i + blockSize;
v = fmax(v, f(x[j], y[j], z[j], w[j]));
sdata[tid] = fmaxf(sdata[tid], v);

i += gridSize;
}
__syncthreads();

Here, x, y, z & w are all float. As expected all the reads are coalesced:
method=[ Z14dev_compute_csILi64EEviPfS0_S0_S0_S0 ] gputime=[ 17121.695 ] cputime=[ 17133.000 ] occupancy=[ 0.500 ] gld_incoherent=[ 0 ] gld_coherent=[ 817280 ] gst_incoherent=[ 10 ] gst_coherent=[ 4 ]

and it runs 2x faster!

the data set size is 7216x2416 which results in 16GB/s in the float case & 8GB/s in float4 case (there are about 20 flops + 1 fmaxf + 2x squareroot in f-function).

I’d like to use float4 because other parts of the code are more efficient with float4-array rather than with 4-float arrays. Moreover, in other kernels of the same code, the float4-reads are coalesced.

Unfortunately, I failed to find the cause what is going on, where is the bug in the algorithm and how to fix it.

If some of you have ideas & suggestions, feel free to suggest! I’m very curious what I am missing here.

Cheers,
Evghenii

Added: The runs were carried out on both 8800Ultra & 8800GTS(512)

PS: The float4-kernel was launched with grid of 64, 128 & 256 blocks & with blockSize=64, 128 & 256. The result are all the same, reads are not coalesced. The extract from cuda profile was from the kernel launch where grid had 256 blocks & blockSize was 256. The 4-float array with coalesced reads used blockSize=128 threads= (128,1,1) & grid = (64,1,1,)

If float4 el = f4_idata; was coalesced before with floats, then there isn’t any reason why it shouldn’t be coalesced now with float4 reads, assuming that the compiler generates a ld.global.v4 .f32 instruction. I see that you are using a function on the components f(el.x, el.y, el.z, el.w); Depending on how those parameters are used in that function, nvcc can sometimes generate multiple delayed loads using ld.global.v2.f32 or even single float loads which breaks the coalescing.

To check for this, compile your kernel with the -keep or -ptx option and identify the load instruction in the code.

One hack to trick the compiler into making a single float4 load is to declare the float4 volatile:
volatile float4 el = f4_idata[i];

I inspected .ptx and found 8x ld.global.f32. So you are completely right, compiler reads 8x floats instead of 2x float4’s

I changed the code in the following way:

[font=“Courier”]volatile float4 el = f4_idata[i];[/font]

and

[font=“Courier”]volatile float4 el1 = f4_idata[i+blockSize];[/font]

and this made all the reads coalesced. There are now 2x instructions ld.volatile.v4

Here is the output of profiler:

method=[ _Z14dev_compute_csILi512EEviP6float4Pf ] gputime=[ 8328.192 ] cputime=[ 8341.000 ] occupancy=[ 0.667 ] gld_incoherent=[ 0 ] gld_coherent=[ 272384 ] gst_incoherent=[ 16 ] gst_coherent=[ 0 ]

The speed up 3x and bandwidth is ~32GB/s (80GFLOP/s)!

Thanks a lot for the help!

On the side: Why compiler loads 4x floats instead of 1xs float4? In other parts of the code, where I also split float4 into registers, the compiler actually loads float4 (I’ve inspected .ptx). Interesting, and same result is with both cuda v1.1 & cuda v2.0 toolkit.

Cheers,

Evghenii

I have no idea myself. I’ve written dozens of kernels and only one has this problem. I would write it off as a compiler bug, but I don’t see a point in submitting it to NVIDIA: none of the other bugs (even simple ones) have been solved.

Well, some have been solved (well, actually ptxas bugs) but that one is a really difficult one. Splitting the reads can save registers and increase speed when you would not get coalescing anyways, but it will hurt if you could get coalescing otherwise. The compiler can not really know which is better.

That kind of thing is the normal thing you get when using a general-purpose language (here C) for special-purpose hardware without adding loads of language-extensions.

See Intel Itanium for how well “we will fix it in the compiler” works out…

It appears, however, that it is better to have coalesced float4 reads instead of 4x coalesced float reads: 1 memory transaction (400 cycles latency) vs 4 memory transactions (4x400 cycles latency). I guess in the former case it is easier to hide the latency. Indeed, my results show that it is 2x faster to have 1 coalesced float4 read (including flops on the read data) instead of 4x coalesced float reads (see my first post above).

This actually brings me to the following question.

We know that memory access latency can be hidden if there are enough active warps, and I see this this in my results. However, I was wondering if it is possible to hide bandwidth limitations?

For example, I read+write 8 floats in total. Assuming that latency is hidden, bandwidth of 64GB/s and performance is 256GFLOP/s this translates to 8float*4bytes/64GBs * 256GFLOPs = 128 flops to make sure that performance is even with the available bandwidth.

Let say I do have a function of 128 flops which operates on these 8 floats that I read/write from/in global memory.

But what happens in this case? Will I reach half of the peak performance (128GFLOP/s instead of 256GFLOP/s) because it takes the same amount of wall-clock time to read data as to process it? Or perhaps scheduler will hide this bandwidth and I will get full 256GFLOP/s?

In my tests it appears that the former is true.

This is quite an important question for local problems were there is very little data re-use. I am currently writing few astrophysical shock hydrodynamics methods on GPU. These are local methods with a stencil size being 4 to 6 cells, i.e. one grid cell need a data from about 4 to 6 neighbouring cells. Given that there are few flops per grid cells (from few hundred to few thousands top). If I just read/write data, I reach bandwidth of ~32GFLOP/s. However, once I add arithmetics, the bandwidth becomes 15GFLOP/s and performance about 90GFLOP/s. Still it is more than 50x faster than equivalent code on 1x core at 3.0GHz. Nevertheless, I use only 25% of GPUs resources.

Clearly, it is nice to get a factor of 2 or 3 more, but is it possible?

Is it something you would expect? Or perhaps I am missing something…

Evghenii

Oh, I know that. nvcc is actually pretty decent about issuing the vector loads, as long as you actually use all the values from the vector right away. If there are even a few math operations between using different elements of the vector the compiler tries to get smart and save a register by splitting the load.

This volatile workaround works now, but will it work in the future? And code with the volatile won’t compile in emulation mode since the CPU compilers barf on it, requiring a #ifdef and making more of a mess. I wouldn’t ask for the compiler to read minds, just a simple command line optimization option not to break vector loads… or even a pragma.

That’s not correct. float4’s coalesce poorly and have half the bandwidth of coalesced float2’s (which themselves are a tad slower than coalesced float’s). Make sure your 4x coalesced float is actually coalescing.

Yes, computation can definately overlap bandwidth. But it’s complicated. Having 2 blocks per multiprocessor can help (or warps that do different things and don’t synchronize). Having a card with a 2:1 multiprocessor-to-memory controller ratio can help (I mean the GTX280). But even then, you have to make sure you’re hitting the memory controllers evenly.

Maybe. At least in my case case 4x coalesced floats are actually coalesced; at least this is what cuda_profile.log reports. So, from my personal experience it appears that float4 coalesced is systematically faster than 4x coalesced floats, where there is some degree of arithmetic intensity. One way to explain this, I guess, is that in the case of 1x float4 coalesced reads there is 1 memory transaction, so only 400 cycles latency. In the case of 4x coalesced floats, there are 4 memory transaction, so latency is 4x higher and is 1600 cycles; it is harder to hide it, I suppose.

Thanks for the explanations. I’ll check it with GTX280 and see if I get full speed. As for now in my code, on G80/G92, it appears that total_time = time_mem_access + time_compute.

Thanks.

Evghenii

1 Like

This is only true on compute 1.0 hardware. compute 1.1 and newer have full bandwidth float4 coalesced reads.

Oh, I see. Still, they shouldn’t be faster.

Hmm, maybe. I don’t know. Were your 4x float reads bunched up together, or did you spread them out differently? I think if they’re bunched up, they’ll end up looking the same as 1x float4 on the DDR end. But if there’s some calcs between them (maybe a register gets reused), then it would incur quadrupled latency. It’s an interesting point. Anyone else observe float4s to be faster?