Is full int64 multiplication supported on Fermi and GT200? What is about 64 bit multiplication speed, i.e. int64xint32 speed? Thanks.

CUDA supports (unsigned) long long and operations on it, on all supported GPUs. On sm_20 devices (due to improved HW support), the performance is quite good. If I recall correctly, sm_20 does a 64x64->64 bit multiply in about 4x the time of a 32x32->32 bit multiply. CUDA also supports __{u}mul64hi() device functions for cases that require the upper half of a 64-bit multiply. This takes about 10x the time of a 32x32->32 bit multiply on sm_20. As I don’t know what algorithm you have in mind, I would suggest to simply give it a try. In my experience, timing code usually beats efforts to reason about performance from a paper specification perspective.

CUDA supports (unsigned) long long and operations on it, on all supported GPUs. On sm_20 devices (due to improved HW support), the performance is quite good. If I recall correctly, sm_20 does a 64x64->64 bit multiply in about 4x the time of a 32x32->32 bit multiply. CUDA also supports __{u}mul64hi() device functions for cases that require the upper half of a 64-bit multiply. This takes about 10x the time of a 32x32->32 bit multiply on sm_20. As I don’t know what algorithm you have in mind, I would suggest to simply give it a try. In my experience, timing code usually beats efforts to reason about performance from a paper specification perspective.

Thanks! I am selecting from multiplication using doubles and i64. I wanted to ensure i64 works, I could not check it right now on computer.

Thanks! I am selecting from multiplication using doubles and i64. I wanted to ensure i64 works, I could not check it right now on computer.

Depending on what your algorithm is doing, double may still be a good alternative since FMA allows efficient access to the full double-wide product. So you may want to code the kernel up both ways, find out which kernel works better on a given GPU through a calibration step at start up, then use the “optimal” kernel for the duration of the application.

Depending on what your algorithm is doing, double may still be a good alternative since FMA allows efficient access to the full double-wide product. So you may want to code the kernel up both ways, find out which kernel works better on a given GPU through a calibration step at start up, then use the “optimal” kernel for the duration of the application.

I need full precision. With doubles I am limitted to smaller numbers, that was the question. With full 64 bit support I can do not bother of size. Would it not be supported at all on old GPU or be very slow, I have to use doubles.

I need full precision. With doubles I am limitted to smaller numbers, that was the question. With full 64 bit support I can do not bother of size. Would it not be supported at all on old GPU or be very slow, I have to use doubles.