I’m trying to get a “maximum cut” of an array of floats. Let’s say we have a M x N array where N (e.g. 1023) is much smaller than M (e.g. 32768).
If I calculate the cut in line direction (M) I can generate 32768 threads with a max of 512 per block this is 128 blocks, which is pretty good on my Tesla C870 with 16MPs. Each thread compares the 511 Values in “its” row and returns the maximum. I know that this might not be the most effective way but the performance of this calculation is no bottleneck for now, so I won’t touch it.
The Problem is the other direction. Here as a first approach I generated the maximum detection likewise and of course got a really bad timing (taking about 0.35s) because the 1023 threads only occupy 2 of my 16MPs and each thread has to compare 32768 values.
I now tried to split the detection into two stages. The first stage divides the 32768 values in M-direction into 128 chunks of values (to have at least 128 MP blocks), thus having each thread compare just 1/128th of the data compared in the version before. The second stage compares the resulting 128 maxima per line.
Now the second stage is quite fast but I almost got no improvements in the timing of the first stage.
Does anyone have an idea how to handle this? Are there any preimplemented functions for 2-dimensional maximum calculation?
I would suggest you look at modifying the reduction example to find the maximum value. That way you will probably be able to do much better. You can do N reductions parallel (using e.g. the y-dimension of your grid) It will take some modification of the reduction code, but for finding, sum, mean, max & min, nothing beats the reduction example.
Thanks for the quick reply, I will try to understand that example.
Another Idea: Is it possible, that I get problems wit uncoalesced or bad memory accesses?
I’m reading directly from global memory and each thread of one block has a stride of 32768 of the accessed adress. Could this be a problem?
I have another procedure running on that data and it also shows bad performance in the first case of structure and no improvement after switching to the second structure. (Although I increased the number of blocks from 2 to 256, so you would expect a big performance boost).
Maybe I misunderstood you, but I think you may have misunderstood the requirements for coalesced memory access: It does not matter what a single thread does and what kind of stride it uses, what is important is that the threads in the same half-warp access adresses which are immediate neighbours (which should cause much more problems with the case that seems to work for you).
I would guess that when you split each maximum calculation into 128 parts, you did that in a way that broke memory coalescing, thus killing any possible speedup.
This would be the case if you split it like that:
Thread 1: column 1, part 1
Thread 2: column 1, part 2
Thread n: column 2, part 1
Because then the threads in one half-warp (e.g. 1-16) use adresses that are not close together at all.
I now have implemented a max calculation from the reduction calculation, seen a big performance boost but the data was incorrect. Now I’m at debugging, but if the performance stays the same this would be really great.
But the second procedure does not work that way. So still the problem with the memory access. For I have a matrix here I do need to work from both dimensions. Therefore one side will show a good performance because of the data neighbours and the other side will be really poor to calculate? Is there a way to avoid this without transposing the data (which would hit the same problem)?
I managed to shrink down the timing for the “horizontal” max detection from 350ms to 8ms with looking at the reduction problem. Other thing is that in the other dimension this version of max calculation is really bad (from 4ms with my old algorithm to about 450ms). So I now use my algorithm in one direction and the reduction algorithm in the other one.
What I think to realize from this example:
Each thread of a warp (or half warp or something) should be using neighbouring adresses for the MP is then able to pick up the data from memory in just one fetch and not for each thread individually.
Next stage is to implement a good filter for downscaling this array (don’t know if I should open a new thread). I’ve tried the box filter example which is showing the same bad behaviour for the filtering in x direction (y direction works quite fine). Other problem is that a box filter function is not really good at anti aliasing the values.
For I’m working with graphic cards here I would like to know if there is any predefined function or library for effective downscaling of arrays, images or something like that.
the transpose thing works, I’m using it for a similar task.
However, I think a better solution is to ape the ConvolutionSeparable example, if it applies here. It has a nice, optimised kernel for the column pass. I went the transpose way because I didn’t understand the convolutionSeparable process.
The next optimisation in my code is actually to remove the transpose and use a separate vertical kernel.