need fast function FNV

Hello everyone. Help with the function of the FNV . I’m tired to fight it .
In its simplest form fnv function looks like this:

fnv4( x, y)
return x * 0x01000193 ^ y;

I am writing on PTX:

mov.u32  	round,0x00;
	$LLBfnv1: 	%rM,[mixzero]; 	%rA,[mixzero+128];	
        mul.hi.u64 	%rt0,%rtM,0x01000193;
	shl.b64  	%rt1,%rM, 32;
	mul.hi.u64 	%rt1,%rt1,0x01000193;
	shl.b64  	%rt0,%rt0, 32;
	xor.b64		%rt0,%rt0,%rt1;	
	xor.b64         %rM,%rt0,%rA;

add.u32     round,round,1; p,round,64;               
	@p bra.uni $LLBfnv1;

I need a way to process 128 bytes in 64 rounds. With that, if i calculate the 16 threads in parallel by 2 bytes ,that result after each round to keep. Because %rM change depending on the round results.

If stored in a shared memory. Then it turns out that I can simultaneously run only 49152/128 = 384 threads. It is very small .
At the moment, I got GTX660 6 800 000 execution functions. If parallels are not just the function itself . A 128 bytes calculate sequentially in each thread.
Then we can get rid of conservation as the thread and so will see the results of 128 bytes .
To give you an example to understand purebasic why it is necessary to see the results after each round:

For i = 0 To 63       
        p=fnv(i ! ValueL(*s), ValueL(*mix+i % w) ) % (n /mixhashes) * mixhashes        
      Next i

This is an option I am using . All data are in register 16 %rA0-%rA15
The only data that must be taken to make the XOR operation are in global memory:

.func _fnv(){
	mov.u64 	 %rB0,%rd0; // Pointer to array in global memory	
	mov.u32  	round,0x00;
	$LLBfnv1: 	 t,[%rB0];
        mul.hi.u64      %rt0,%rA0,0x01000193;
	shl.b64  	%rt1,%rA0, 32;
	mul.hi.u64      %rt1,%rt1,0x01000193;
	shl.b64  	%rt0,%rt0, 32;
	xor.b64		%rt0,%rt0,%rt1;	
	xor.b64     %rA0,%rt0,t; 	 t,[%rB0+8];
        mul.hi.u64      %rt0,%rA1,0x01000193;
	shl.b64  	%rt1,%rA1, 32;
	mul.hi.u64      %rt1,%rt1,0x01000193;
	shl.b64  	%rt0,%rt0, 32;
	xor.b64		%rt0,%rt0,%rt1;	
	xor.b64     %rA1,%rt0,t;
	// and so on 	 t,[%rB0+112];
        mul.hi.u64      %rt0,%rA15,0x01000193;
	shl.b64  	%rt1,%rA15, 32;
	mul.hi.u64      %rt1,%rt1,0x01000193;
	shl.b64  	%rt0,%rt0, 32;
	xor.b64		%rt0,%rt0,%rt1;	
	xor.b64     %rA15,%rt0,t;

	add.u32     round,round,1; p,round,64;               
	@p bra.uni $LLBfnv1;	

But still 6 million perform the functions it is not enough for my list of video cards

Most interesting is that if you do the function for the 1st 64 bit word . the result is 30M function calls for 2 words - 29M . For 3 -25 , and for 4- word immediately drops to 10M

[1] I would be surprised if this code benefits from coding at the PTX level, rather than at high-level C++ level. 64-bit data types and __umul64hi() are available at HLL level.

[2] The “simplest form” code does not seem to match what is shown in the PTX code, but in part it is hard to tell because the “simplest code” does not reflect data types.

[3] The PTX code itself appears to be memory bound, rather than compute bound; you can use the CUDA profiler to confirm.

[4] I don’t understand the “purebasic” reference.

Yes, especially not matter on what to write. I PTX closer. If you compile __umul64hi () in the PTX, it will not be surprised.

C code:

extern "C" __global__ void imul(long long int * input, long long int * output )
    long long int a=0x01000193;	
	output[0]=__umul64hi (a, input[0]);

Ptx compiled code:

.visible .entry imul(
	.param .u32 imul_param_0,
	.param .u32 imul_param_1
	.reg .b32 	%r<5>;
	.reg .b64 	%rd<4>;

	ld.param.u32 	%r1, [imul_param_0];
	ld.param.u32 	%r2, [imul_param_1]; 	%r3, %r2; 	%r4, %r1; 	%rd1, [%r4];
	mov.u64 	%rd2, 16777619;
	mul.hi.u64 	%rd3, %rd2, %rd1; 	[%r3], %rd3;

Just to add garbage. It is therefore easier to write on the PTX, in any case, have any idea what will happen (not quite sure, because this is not the assembler)

Just do not care for a long time carried out the multiplication code. I used the function SHA3 kessak. There, there code 20 times and 38 million hashes per second. And here is a simple function, and eats a lot of resources. I just thought maybe someone knows how to do is easier.

In general, the compiler C never compile this code, no matter what level of optimization you are not exposed.

mov.b64         %rA0,0xff;     //same number....
shl.b64  	%rt0,%rA0, 24;
shl.b64  	%rt1,%rA0, 8;
add.u64		%rt0,%rt0,%rt1;	
shl.b64  	%rt1,%rA0, 7;
add.u64		%rt0,%rt0,%rt1;
shl.b64  	%rt1,%rA0, 4;
add.u64		%rt0,%rt0,%rt1;
shl.b64  	%rt1,%rA0, 1;
add.u64		%rt0,%rt0,%rt1;	
add.u64		%rA0,%rt0,%rA0;

Although this code is faster than the “mul”

I have a hard time determining what you are trying to say. I note that you are pairing ‘long long int’ with __umul64hi(), that is a potential bug as umul64hi() assumes ‘unsigned long long int’, and the upper 64-bits of the full product differ between signed and unsigned multiplication.

Your code seems to perform one 64-bit multiply, one 64-bit shift, and one 64-bit exclusive-OR for every 64-bit item loaded. That would appear to make the code memory bound, and I would therefore expect attempts to speed up this code by reducing the instruction count to fail. Is that what you are observing? What does the CUDA profiler say about bottlenecks in the code (when executed in context)?

Your point about possible optimizations for multiplies with constant multipliers that have only few 1-bits is valid, given that 64-bit multiplies are emulated on the GPU. There could be material for an enhancement request there, but it is too early to tell. It seems that your example above shows replacement code for mul.lo.u64, not mul.hi.u64. Note that comparisons of code efficiency should always be at the machine code (SASS) level, since PTXAS is an optimizing compiler, not an assembler, and PTX is only an intermediate code representation.

Everything is fine. So far, I have achieved the desired result. 1 sha3 is a function then FNV then sha3 again. Total 18 millions operations per second as a result. So far I am satisfied. Thanks for answers.