parallel maximum detection bad performance


I’m trying to get a “maximum cut” of an array of floats. Let’s say we have a M x N array where N (e.g. 1023) is much smaller than M (e.g. 32768).

If I calculate the cut in line direction (M) I can generate 32768 threads with a max of 512 per block this is 128 blocks, which is pretty good on my Tesla C870 with 16MPs. Each thread compares the 511 Values in “its” row and returns the maximum. I know that this might not be the most effective way but the performance of this calculation is no bottleneck for now, so I won’t touch it.

The Problem is the other direction. Here as a first approach I generated the maximum detection likewise and of course got a really bad timing (taking about 0.35s) because the 1023 threads only occupy 2 of my 16MPs and each thread has to compare 32768 values.

I now tried to split the detection into two stages. The first stage divides the 32768 values in M-direction into 128 chunks of values (to have at least 128 MP blocks), thus having each thread compare just 1/128th of the data compared in the version before. The second stage compares the resulting 128 maxima per line.

Now the second stage is quite fast but I almost got no improvements in the timing of the first stage.

Does anyone have an idea how to handle this? Are there any preimplemented functions for 2-dimensional maximum calculation?


I would suggest you look at modifying the reduction example to find the maximum value. That way you will probably be able to do much better. You can do N reductions parallel (using e.g. the y-dimension of your grid) It will take some modification of the reduction code, but for finding, sum, mean, max & min, nothing beats the reduction example.

Thanks for the quick reply, I will try to understand that example.

Another Idea: Is it possible, that I get problems wit uncoalesced or bad memory accesses?

I’m reading directly from global memory and each thread of one block has a stride of 32768 of the accessed adress. Could this be a problem?

I have another procedure running on that data and it also shows bad performance in the first case of structure and no improvement after switching to the second structure. (Although I increased the number of blocks from 2 to 256, so you would expect a big performance boost).

Maybe I misunderstood you, but I think you may have misunderstood the requirements for coalesced memory access: It does not matter what a single thread does and what kind of stride it uses, what is important is that the threads in the same half-warp access adresses which are immediate neighbours (which should cause much more problems with the case that seems to work for you).

I would guess that when you split each maximum calculation into 128 parts, you did that in a way that broke memory coalescing, thus killing any possible speedup.

This would be the case if you split it like that:

Thread 1: column 1, part 1

Thread 2: column 1, part 2

Thread n: column 2, part 1

Because then the threads in one half-warp (e.g. 1-16) use adresses that are not close together at all.

You must split it like this:

Thread 1: column 1, part 1

Thread 2: column 2, part 1

Thread n: column 1, part 2

Alright, that was exactly what I wanted to know.

I now have implemented a max calculation from the reduction calculation, seen a big performance boost but the data was incorrect. Now I’m at debugging, but if the performance stays the same this would be really great.

But the second procedure does not work that way. So still the problem with the memory access. For I have a matrix here I do need to work from both dimensions. Therefore one side will show a good performance because of the data neighbours and the other side will be really poor to calculate? Is there a way to avoid this without transposing the data (which would hit the same problem)?

I’ve been trying to explain coalesced access to some colleagues, and I think I’ve found a simple way of seeing and explaining it. May I present it?

Very, very roughly, the formula is like this:

threadID % 16 == globalArrayIndex % 16

This will be quick:

tmp = globalArray[threadIdx.x];

This will be slow:

tmp = globalArray[threadIdx.x + 1];

This will be quick again:

tmp = globalArray[threadIdx.x + 16];

Then, there are a million exceptions, but this is the rule as I see it and as I follow it.

Huh? For global memory I have not seen anything saying something about “unaligned” access, so

tmp = globalArray[threadIdx.x + 1];

should work just as well as

tmp = globalArray[threadIdx.x];

for global (and local) memory. Shared memory is a different thing though.

But here I think the question is

tmp = globalArray[threadIdx.x];

tmp2 = globalArray[threadIdx.x + 32768];


tmp = globalArray[32768*threadIdx.x];

tmp2 = globalArray[32768*threadIdx.x + 1];

and to my understanding the second one will be much slower.

Unfortunately without seeing the code it is hard to know which of the many possible things are going wrong here.

Read the programming guide, page 53 and further. The base address has some requirements too for coalescing.

Thanks for the hints and explanation so far!

I managed to shrink down the timing for the “horizontal” max detection from 350ms to 8ms with looking at the reduction problem. Other thing is that in the other dimension this version of max calculation is really bad (from 4ms with my old algorithm to about 450ms). So I now use my algorithm in one direction and the reduction algorithm in the other one.

What I think to realize from this example:

Each thread of a warp (or half warp or something) should be using neighbouring adresses for the MP is then able to pick up the data from memory in just one fetch and not for each thread individually.

Next stage is to implement a good filter for downscaling this array (don’t know if I should open a new thread). I’ve tried the box filter example which is showing the same bad behaviour for the filtering in x direction (y direction works quite fine). Other problem is that a box filter function is not really good at anti aliasing the values.
For I’m working with graphic cards here I would like to know if there is any predefined function or library for effective downscaling of arrays, images or something like that.

I suspect you have uncoalesced reads on the column pass.

Couldn’t you do the reduction in the horizontal direction, transpose the array, and then do the horiz. reduction again? Look at the transpose example in the SDK.

I agree with this, and …

the transpose thing works, I’m using it for a similar task.

However, I think a better solution is to ape the ConvolutionSeparable example, if it applies here. It has a nice, optimised kernel for the column pass. I went the transpose way because I didn’t understand the convolutionSeparable process.

The next optimisation in my code is actually to remove the transpose and use a separate vertical kernel.

If you are just reading, tex2D reads from a cudaArray are just as fast along rows and down columns.