Maybe I am a bit dense today, but after staring at the example for several minutes I still haven’t figured out what the underlying algorithmic specification is.

In a single thread, we usually do out[i] = out[i-1]+in[i] in order starting from in[0].
I would like to set a threshold value for this, so that if the inclusive sum exceeds it, in[i] is output as is, without accumulation.

template<typename T>
struct RoundSum {
T thres;
__host__ __device__ __forceinline__ T
operator()(const T& a, const T& b) const {
T tmp = a + b;
return (tmp > thres) ? b : tmp;
}
};

I too wish I could compute the prefix sum of the previous segment, but I fall into the recursion of needing the segment boundary to know the segment boundary.

Can you find the segment boundaries by dividing every element of the prefix sum by 20, checking where the results increase?
0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2