I was referring to slides on implementation of reduction routines to calculate SUM of elements in vector.(https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf)
I came across the concept of thread coarsening and block coarsening.
I have CUDA program implementing all reductions mentioned in the above slides.
Could anyone please provide link of any sample code of reduction routine implemented with thread and block coarsening?
Help in this regard will be highly appreciated.
Thanks in advance.