32-bit multiplication and 64-bit registers

Mike_Condron · December 5, 2008, 11:04pm

I see in the PTX ISA Version 1.2 document that it’s possible to multiply two 32-bit numbers and get a 64-bit result. For example:

[indent]mul.wide.u32 u, a, b; // u = a*b[/indent]
Is this 64-bit result stored in a single physical 64-bit hardware register?

Do all hardware devices support this operation? The documentation only mentions that sm_13 is required for .f64 but doesn’t mention anything about .wide.

In general, are the 64-bit operand types supported as physical 64-bit registers on all devices?

alex_dubinsky · December 6, 2008, 3:50am

Yes, the hardware has 64bit registers. These are two ordinary registers combined.

To get the details, compile some code and look at it in decuda (since ptx isn’t real machine code). There isn’t 1-cycle multiplication of even 32bit ints in hardware. It’d be interesting to see how it’s actually implemented.

Mike_Condron · December 6, 2008, 4:47pm

Here’s what I found. As you said, there doesn’t seem to be any hardware support for 32x32 bit integer multiplication, or if there is, PTXAS doesn’t know about it.

Input code fed into PTXAS:

[codebox].version 1.2

.target sm_13

.global .u32 ga;

.global .u32 gb;

.global .u64 gu;

.entry MULT16X16WIDE

{

.reg	.u16	a;

.reg	.u16	b;

.reg	.u32	u;

ld.global.u16 a, [ga];

ld.global.u16 b, [gb];	

mul.wide.u16 u,a,b;

st.global.u32 [gu], u;

}

.entry MULT32X32WIDE

{

.reg	.u32	a;

.reg	.u32	b;

.reg	.u64	u;

ld.global.u32 a, [ga];

ld.global.u32 b, [gb];	

mul.wide.u32 u,a,b;

st.global.u64 [gu], u;

}

.entry MULT32X32HILO

{

.reg	.u32	a;

.reg	.u32	b;

.reg	.u32	u;

.reg	.u32	v;

ld.global.u32 a, [ga];

ld.global.u32 b, [gb];	

mul.lo.u32 u,a,b;

mul.hi.u32 v,a,b;

st.global.u32 [gu], u;

st.global.u32 [gu+4],v;

}

[/codebox]

Here is what DECUDA produced from the cubin file that PTXAS generated from that:

[codebox]// Disassembling MULT16X16WIDE (0)

000000: 10000001 2780c780 mov.b32 $r0, c14[0x0000]

000008: 10000209 2780c780 mov.b32 $r2, c14[0x0004]

000010: d00e0005 80400780 mov.u16 $r1, g[$r0]

000018: d00e0401 80400780 mov.u16 $r0, g[$r2]

000020: 10000409 2780c780 mov.b32 $r2, c14[0x0008]

000028: 40000401 00000780 mul24.lo.u32.u16.u16 $r0, $r1.lo, $r0.lo

000030: d00e0401 a0c00781 mov.end.u32 g[$r2], $r0

// Disassembling MULT32X32WIDE (1)

000000: 10000001 2780c780 mov.b32 $r0, c14[0x0000]

000008: 10000205 2780c780 mov.b32 $r1, c14[0x0004]

000010: d00e0001 80c00780 mov.u32 $r0, g[$r0]

000018: d00e0205 80c00780 mov.u32 $r1, g[$r1]

000020: 40030009 00000780 mul24.lo.u32.u16.u16 $r2, $r0.lo, $r1.hi

000028: 6002020d 000087c0 mad24.lo.u32.u16.u16.u32 $p0|$r3, $r0.hi, $r1.lo, $r2

000030: 30100611 c4100780 shl.u32 $r4, $r3, 0x00000010

000038: 60020209 00008780 mad24.lo.u32.u16.u16.u32 $r2, $r0.hi, $r1.lo, $r2

000040: 3010060d e4100780 shr.u32 $r3, $r3, 0x00000010

000048: 600201fd 000107d8 mad24.lo.u32.u16.u16.u32 $p1|$o127, $r0.lo, $r1.lo, $r4

000050: 30100409 c4100780 shl.u32 $r2, $r2, 0x00000010

000058: 2100060d 04400880 @$p0.cf add.u32 $r3, $r3, c1[0x0000]

000060: 60020009 00008780 mad24.lo.u32.u16.u16.u32 $r2, $r0.lo, $r1.lo, $r2

000068: 6003020d 0c00d780 mad24.lo.u32.u16.u16.u32 $r3, -$r0.hi, $r1.hi, -$r3

000070: 10000401 2780c780 mov.b32 $r0, c14[0x0008]

000078: d00e0009 a0800781 mov.end.b64 g[$r0], $r2

// segment: const (1:0000)

0000: 00010000

// Disassembling MULT32X32HILO (2)

000000: 10000001 2780c780 mov.b32 $r0, c14[0x0000]

000008: 10000205 2780c780 mov.b32 $r1, c14[0x0004]

000010: d00e0001 80c00780 mov.u32 $r0, g[$r0]

000018: d00e0205 80c00780 mov.u32 $r1, g[$r1]

000020: 40030009 00000780 mul24.lo.u32.u16.u16 $r2, $r0.lo, $r1.hi

000028: 6002020d 00008780 mad24.lo.u32.u16.u16.u32 $r3, $r0.hi, $r1.lo, $r2

000030: 60020209 000087c0 mad24.lo.u32.u16.u16.u32 $p0|$r2, $r0.hi, $r1.lo, $r2

000038: 3010060d c4100780 shl.u32 $r3, $r3, 0x00000010

000040: 30100415 c4100780 shl.u32 $r5, $r2, 0x00000010

000048: 30100409 e4100780 shr.u32 $r2, $r2, 0x00000010

000050: 60020011 0000c780 mad24.lo.u32.u16.u16.u32 $r4, $r0.lo, $r1.lo, $r3

000058: 600201fd 000147d8 mad24.lo.u32.u16.u16.u32 $p1|$o127, $r0.lo, $r1.lo, $r5

000060: 1000040d 2780c780 mov.b32 $r3, c14[0x0008]

000068: 21000409 04400880 @$p0.cf add.u32 $r2, $r2, c1[0x0000]

000070: 10048015 00000003 mov.b32 $r5, 0x00000004

000078: d00e0611 a0c00780 mov.u32 g[$r3], $r4

000080: 60030205 0c009780 mad24.lo.u32.u16.u16.u32 $r1, -$r0.hi, $r1.hi, -$r2

000088: 21000a01 07808780 add.u32 $r0, $r5, c2[0x0008]// (unk1 03000000)

000090: d00e0005 a0c00781 mov.end.u32 g[$r0], $r1

// segment: const (1:0000)

0000: 00010000

[/codebox]

So it seems clear that any attempt to multiply two 32-bit numbers actually expands into several smaller multiplications. Maybe I can hand-code something a little more efficient than this…

alex_dubinsky · December 6, 2008, 10:06pm

And what happens during 64-bit addition?

Also, I noticed you tried to dereference a 64bit pointer, but ptxas just ignored it. Are you on a 64bit OS?

asm · December 10, 2008, 3:07pm

I’ve been working on bignum arithmetic on GPU, unfortunately, yes integer multiplication is really slow…

and I tend to believe that the one used by ptx is optimal.

Its not easy to improve it, one reason is that PTX assembler

does not provide an explicit access to the control word (as opposed to x86 assembler), for instance ‘mad24’ instructions can generate optional carry flags

which then can be used in subsequent add instructions, from your example:

mad24.lo.u32.u16.u16.u32 $p0|$r2, $r0.hi, $r1.lo, $r2

…

@$p0.cf add.u32 $r2, $r2, c1[0x0000] ← add if carry flag ‘cf’ is set

Although, on sm13 architectures (with double-precision) one might think of an old trick,

when 52 MSB bits of 32-bit wide multiplication are computed in floating point and the rest 12 LSB bits

come from the ordinary mad24 multiplication…

However I heard that double-precision on sm13 is much slower than single-precision, is this true ?

E.D_Riedijk · December 10, 2008, 3:11pm

There is 1 DP unit, as opposed to 8 SP units on a multiprocessor, so it is 8 times slower.

asm · December 10, 2008, 3:25pm

wow this looks really slow…

it means that one double-precision mutliplication

costs 2 32-bit integer multiplications if we assume that integer takes 16 clock cycles

Topic		Replies	Views
how to implement mul.wide.u32 in C code 32-bit multiplication and 64-bit registers CUDA Programming and Performance	4	2314	July 29, 2009
PTX u32 wide multiplication How-to and performance characteristics? CUDA Programming and Performance	7	2088	October 12, 2010
how to implement mul.wide.u32 in C code? 32-bit multiplication and 64-bit registers CUDA Programming and Performance	0	1704	July 28, 2009
long-integer multiplication: mul.wide.u64 and mul.wide.u128 CUDA Programming and Performance	31	7910	January 2, 2018
32-bit number multiplication CUDA Programming and Performance	23	20654	July 1, 2012
int64 support and speed? CUDA Programming and Performance	8	11160	September 1, 2010
Integer multiplication How to force nvcc to generate mul.hi? CUDA Programming and Performance	1	3055	August 31, 2007
Cuda 3.5 Integer Multiply Performance Is it really 3x slower than 64-bit floating point? CUDA Programming and Performance	21	20037	March 12, 2014
Multiplication methods CUDA Programming and Performance	6	1539	December 20, 2013
Integer Arithmetic 32 integer arithmetic performance CUDA Programming and Performance	4	6895	March 7, 2007

32-bit multiplication and 64-bit registers

Related topics