Reduction routines with thread and block coarsening

I was referring to slides on implementation of reduction routines to calculate SUM of elements in vector.(https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf)
I came across the concept of thread coarsening and block coarsening.
I have CUDA program implementing all reductions mentioned in the above slides.
Could anyone please provide link of any sample code of reduction routine implemented with thread and block coarsening?
Help in this regard will be highly appreciated.

Thanks in advance.

In case anyone was wondering what thread and block coarsening meant, I guess it is defined here:

https://dl.acm.org/citation.cfm?id=3194242

https://stackoverflow.com/questions/59052132/block-coarsening-with-cuda-reduction-kernel-produce-incorrect-results