Why is this a uint64_t subtraction and bit shift generating bottleneck?

It seems to me your major bottleneck is this:

I am not sitting in front of the profiler, but I think “Mio” is memory I/O. In other words, the cores are waiting to insert new memory operations into the I/O queue. This would suggest your code is bound by memory throughput, not computation. Since I cannot tell what is going on in your code overall, I would simply suggest double-checking the memory aspect.