I am doing reductions with these requirements:
- I need to do it inside a kernel as a block-wide reduction
- I don’t want to make any assumption on the size of the array
- I care about performances
I worked with CUB, but with it, I need to assume at least the maximum size.
I would prefer to not implement it by myself because I want to achieve the best performance.
There are any alternatives?
What is the problem of your current approach?
You can simply iterate over chunks of your input data and reduce the chunks.
Yes, this is the approach I am doing right now (previously I just tested CUB with fixed-size input).
I wondered if there exists something more efficient than a for loop with fixed thread blocks since I guess it is a common problem
You can write a block-wide reduction using a block-stride loop. It will be very efficient, and makes no assumptions about block size (other than power-of-2) or data set size.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.