64-bit integer operations are emulated on all NVIDIA GPUs. Their exact performance differs by GPU architecture and could even differ based on code context. It would therefore be best to measure the performance in actual context, rather than relying on estimates.

To estimate the performance, consider that 64-bit integer addition, subtraction, negation, and logical operation are emulated using two 32-bit arithmetic or logic instructions each. 64-bit multiply is an inlined instruction sequence roughly equivalent to about four 32-bit integer multiplies for the low order 64 bits of the result. About twice that for the high-order 64 bits of the result, i.e. __umul64hi(). 64-bit integer division or modulo are called subroutines roughly equivalent to about forty 32-bit multiplies plus forty 32-bit arithmetic / logic instructions.

[Later:]

I noticed I left out shift operations. On recent GPUs these are equivalent to about four 32-bit integer adds for the general case, due to the presence of a funnel shifter. On older GPUs they were roughly equivalent to ten 32-bit integer arithmetic / logic instructions for the general case.