Thrust reduce_by_key performance question

I’ve compiled a sample code ( from here and ran it throw nvprof.
I’m seeing there what I see in my own project - the actual time of the kernels/copies is ~ half of the overall runtime time. Is there something that I can do to reduce the amount of time this operation takes?

I’m running on Linux with a V100 card.

(I have a screenshot but couldn’t find how to upload it :) )