I’m trying to get a “maximum cut” of an array of floats. Let’s say we have a M x N array where N (e.g. 1023) is much smaller than M (e.g. 32768).
If I calculate the cut in line direction (M) I can generate 32768 threads with a max of 512 per block this is 128 blocks, which is pretty good on my Tesla C870 with 16MPs. Each thread compares the 511 Values in “its” row and returns the maximum. I know that this might not be the most effective way but the performance of this calculation is no bottleneck for now, so I won’t touch it.
The Problem is the other direction. Here as a first approach I generated the maximum detection likewise and of course got a really bad timing (taking about 0.35s) because the 1023 threads only occupy 2 of my 16MPs and each thread has to compare 32768 values.
I now tried to split the detection into two stages. The first stage divides the 32768 values in M-direction into 128 chunks of values (to have at least 128 MP blocks), thus having each thread compare just 1/128th of the data compared in the version before. The second stage compares the resulting 128 maxima per line.
Now the second stage is quite fast but I almost got no improvements in the timing of the first stage.
Does anyone have an idea how to handle this? Are there any preimplemented functions for 2-dimensional maximum calculation?