compare to hashes

Hello. There are two hashes comparison procedure of length 32 bytes. The result of 1 if the first hash(%rd0-%rd3) is less than the second hash (%rd26-%rd29). And if 0 is equal to or greater than the second.

mov.u64    	%rA10,0x00;				//равны
	mov.u64    	%rA11,0x01;				//меньше
	mov.u64    	%rA12,0x02;				//больше
	setp.hi.u64 p,%rA0,%rA26;
	selp.u64 	%rA11,%rA12,%rA11,p;
	setp.eq.u64 p,%rA0,%rA26;
	selp.u64 	%rA10,%rA10,%rA11,p;
	setp.hi.u64 p,%rA1,%rA27;
	selp.u64 	%rA11,%rA12,%rA11,p;
	setp.eq.u64 p,%rA1,%rA27;
	selp.u64 	%rA10,%rA10,%rA11,p;
	setp.hi.u64 p,%rA2,%rA28;
	selp.u64 	%rA11,%rA12,%rA11,p;
	setp.eq.u64 p,%rA2,%rA28;
	selp.u64 	%rA10,%rA10,%rA11,p;
	setp.hi.u64 p,%rA3,%rA29;
	selp.u64 	%rA11,%rA12,%rA11,p;
	setp.eq.u64 p,%rA3,%rA29;
	selp.u64 	%rA10,%rA10,%rA11,p;

But speed is not impressive. Can anyone have any optimizations?

And another procedure multiplication of 64-bit register (%rM). As part of this register two number of 32 Bit. Each number must be multiplied by 0x01000193. I write this function:

mul.lo.u64 		        %rt0,%rM,0x01000193;									
	and.b64		%rt0,%rt0,0xffffffff;	
	shr.b64  	%rt1,%rM, 32;
	mul.lo.u64 	%rt1,%rt1,0x01000193;
	shl.b64  	%rt1,%rt1, 32;
	xor.b64		%rM,%rt0,%rt1;

And also it makes me sad performance…
Have ideas for optimization?

Can you show the original C/C+ code, please? Your verbal descriptions seem to deviate from the code snippets you are showing. What architecture are you compiling for? Have you inspected the SASS produced by PTXAS? Remember that PTXAS is an optimizing compiler, not an assembler.

Note that select-type instructions (or ternary operator ?:) are not always the most efficient way to make selections. Sometimes it is better to provide a default result and conditionally override it with an if-statement, letting the PTXAS optimizer perform if-conversion by using predicated instructions. This frequently saves one instruction. E.g. instead of

d = (x < y) ? (a + b) : (a + c)  // two adds, one comparison, one select

you would use

d = (a + b)            // one add
if (x < y) d = a + c   // one comparison, one conditional add

By the way, why are your variables in the hash computation 64-bit quantities? The FNV hashes I am looking at use 0x01000193 as the prime multiplier for a 32-bit hash. In other words:

uint32_t fnv1a (uint8_t byte, uint32_t hash = SEED) { return (byte ^ hash) * PRIME; }

where PRIME = 0x01000193, and SEED = [value of your choice], for example.

I am not use C/C++ code Iam write host code at purebasic and device code at PTX only. I use SM_30 architecture. Iam not inspected SAAS code…
I use 64-bit computing and registers, because I tried different options, including 32-bit, but in the end the highest rate of using 64-bit computing.
Those. I have 3 functions. 1st is the SHA3 kessak 512, and the second is the FNV and the third is SHA3 kessak 256. SHA3 512 makes a hash. Then on that hash calculations are performed using the FNV function, then the data is compressed and hashed again with SHA3 256.
The resulting 32-byte hash, I need to compare with the sample.
All functions run-in and work. I just wrote a similar program on the CPU and the result was 30% higher than the standard program. And I want the same result is achieved on CUDA

But as long as my result does not impress me. Perhaps maybe i need to apply any instructions for caching “prefetch” for example or something else. While CUDA new to me, I used to work only with the CPU.

Very strange behavior. I wrote a multiplication function 32 bit number 32 bit constant number 0x01000193. I decided not to use mul.u32 as the speed of execution of this instruction depends on how many warps.
And I took mul24.u32. And total speed go down. How can this be?

mov.u32 	number,0xa4b13027; //any 32 bit unsigned integer
        // multiply 24-31 bit
	shr.b32		temp,number,24;
	mul24.lo.u32 temp2,temp,8388809;	
	mul24.lo.u32 temp,temp,8388810;
	add.u32		temp2,temp,temp2;	
	shl.b32		temp2,temp2,24;
	// multiply 0-23 bit
	mul24.lo.u32 temp,number,8388809;	
	add.u32		temp2,temp,temp2;
	mul24.lo.u32 temp,number,8388810;	
	add.u32		number,temp,temp2;

All instruction need 4cicles per warp. Total 9*4=36 cicles/warp