Hello. There are two hashes comparison procedure of length 32 bytes. The result of 1 if the first hash(%rd0-%rd3) is less than the second hash (%rd26-%rd29). And if 0 is equal to or greater than the second.
mov.u64 %rA10,0x00; //равны
mov.u64 %rA11,0x01; //меньше
mov.u64 %rA12,0x02; //больше
setp.hi.u64 p,%rA0,%rA26;
selp.u64 %rA11,%rA12,%rA11,p;
setp.eq.u64 p,%rA0,%rA26;
selp.u64 %rA10,%rA10,%rA11,p;
setp.hi.u64 p,%rA1,%rA27;
selp.u64 %rA11,%rA12,%rA11,p;
setp.eq.u64 p,%rA1,%rA27;
selp.u64 %rA10,%rA10,%rA11,p;
setp.hi.u64 p,%rA2,%rA28;
selp.u64 %rA11,%rA12,%rA11,p;
setp.eq.u64 p,%rA2,%rA28;
selp.u64 %rA10,%rA10,%rA11,p;
setp.hi.u64 p,%rA3,%rA29;
selp.u64 %rA11,%rA12,%rA11,p;
setp.eq.u64 p,%rA3,%rA29;
selp.u64 %rA10,%rA10,%rA11,p;
But speed is not impressive. Can anyone have any optimizations?
And another procedure multiplication of 64-bit register (%rM). As part of this register two number of 32 Bit. Each number must be multiplied by 0x01000193. I write this function:
mul.lo.u64 %rt0,%rM,0x01000193;
and.b64 %rt0,%rt0,0xffffffff;
shr.b64 %rt1,%rM, 32;
mul.lo.u64 %rt1,%rt1,0x01000193;
shl.b64 %rt1,%rt1, 32;
xor.b64 %rM,%rt0,%rt1;
And also it makes me sad performance…
Have ideas for optimization?