What's the fastest way to do long multiplications

What would be the fastest way to do (exact) multiplications of 2 integers - 32bit (64bit result), 48bit (96bit result) and 64 bit (128bit result) in CUDA?
It’s to be used during matrix multiplication so performance is critically important.

The fastest way is using different threads for different multiplications.