Thrust reduce_by_key performance question

Hi,
I’ve compiled a sample code (sum_rows.cu) from here [url]https://github.com/thrust/thrust/tree/master/examples[/url] and ran it throw nvprof.
I’m seeing there what I see in my own project - the actual time of the kernels/copies is ~ half of the overall runtime time. Is there something that I can do to reduce the amount of time this operation takes?

I’m running on Linux with a V100 card.

(I have a screenshot but couldn’t find how to upload it :) )

thanks