int64 support and speed?

Lev · September 1, 2010, 6:45pm

Is full int64 multiplication supported on Fermi and GT200? What is about 64 bit multiplication speed, i.e. int64xint32 speed? Thanks.

njuffa · September 1, 2010, 7:53pm

CUDA supports (unsigned) long long and operations on it, on all supported GPUs. On sm_20 devices (due to improved HW support), the performance is quite good. If I recall correctly, sm_20 does a 64x64->64 bit multiply in about 4x the time of a 32x32->32 bit multiply. CUDA also supports __{u}mul64hi() device functions for cases that require the upper half of a 64-bit multiply. This takes about 10x the time of a 32x32->32 bit multiply on sm_20. As I don’t know what algorithm you have in mind, I would suggest to simply give it a try. In my experience, timing code usually beats efforts to reason about performance from a paper specification perspective.

njuffa · September 1, 2010, 7:53pm

CUDA supports (unsigned) long long and operations on it, on all supported GPUs. On sm_20 devices (due to improved HW support), the performance is quite good. If I recall correctly, sm_20 does a 64x64->64 bit multiply in about 4x the time of a 32x32->32 bit multiply. CUDA also supports __{u}mul64hi() device functions for cases that require the upper half of a 64-bit multiply. This takes about 10x the time of a 32x32->32 bit multiply on sm_20. As I don’t know what algorithm you have in mind, I would suggest to simply give it a try. In my experience, timing code usually beats efforts to reason about performance from a paper specification perspective.

Lev · September 1, 2010, 9:57pm

Thanks! I am selecting from multiplication using doubles and i64. I wanted to ensure i64 works, I could not check it right now on computer.

Lev · September 1, 2010, 9:57pm

Thanks! I am selecting from multiplication using doubles and i64. I wanted to ensure i64 works, I could not check it right now on computer.

njuffa · September 1, 2010, 10:06pm

Depending on what your algorithm is doing, double may still be a good alternative since FMA allows efficient access to the full double-wide product. So you may want to code the kernel up both ways, find out which kernel works better on a given GPU through a calibration step at start up, then use the “optimal” kernel for the duration of the application.

njuffa · September 1, 2010, 10:06pm

Depending on what your algorithm is doing, double may still be a good alternative since FMA allows efficient access to the full double-wide product. So you may want to code the kernel up both ways, find out which kernel works better on a given GPU through a calibration step at start up, then use the “optimal” kernel for the duration of the application.

Lev · September 1, 2010, 10:11pm

I need full precision. With doubles I am limitted to smaller numbers, that was the question. With full 64 bit support I can do not bother of size. Would it not be supported at all on old GPU or be very slow, I have to use doubles.

Lev · September 1, 2010, 10:11pm

I need full precision. With doubles I am limitted to smaller numbers, that was the question. With full 64 bit support I can do not bother of size. Would it not be supported at all on old GPU or be very slow, I have to use doubles.