need fast function FNV

Hello everyone. Help with the function of the FNV . I’m tired to fight it .
In its simplest form fnv function looks like this:

fnv4( x, y)
return x * 0x01000193 ^ y;

I am writing on PTX:

mov.u32  	round,0x00;
	$LLBfnv1: 	%rM,[mixzero]; 	%rA,[mixzero+128];	
        mul.hi.u64 	%rt0,%rtM,0x01000193;
	shl.b64  	%rt1,%rM, 32;
	mul.hi.u64 	%rt1,%rt1,0x01000193;
	shl.b64  	%rt0,%rt0, 32;
	xor.b64		%rt0,%rt0,%rt1;	
	xor.b64         %rM,%rt0,%rA;

        add.u32     round,round,1; p,round,64;               
	@p bra.uni $LLBfnv1;

I need a way to process 128 bytes in 64 rounds. With that, if i calculate the 16 threads in parallel by 2 bytes ,that result after each round to keep. Because %rM change depending on the round results.

If stored in a shared memory. Then it turns out that I can simultaneously run only 49152/128 = 384 threads. It is very small .
At the moment, I got GTX660 6 800 000 execution functions. If parallels are not just the function itself . A 128 bytes calculate sequentially in each thread.
Then we can get rid of conservation as the thread and so will see the results of 128 bytes .
To give you an example to understand purebasic why it is necessary to see the results after each round:

For i = 0 To 63       
        p=fnv(i ! ValueL(*s), ValueL(*mix+i % w) ) % (n /mixhashes) * mixhashes        
      Next i