I need to do it inside a kernel as a block-wide reduction
I don’t want to make any assumption on the size of the array
I care about performances
I worked with CUB, but with it, I need to assume at least the maximum size.
I would prefer to not implement it by myself because I want to achieve the best performance.
You can write a block-wide reduction using a block-stride loop. It will be very efficient, and makes no assumptions about block size (other than power-of-2) or data set size.