Hi
the attached code is a heavily stripped down version of an application which needs an iterated whirlpool hash for a lot of inputs.
Inside the code each thread basically hashes (actually it does only hash compression without finalization) its own private input data.
There is no dependency on results of other threads.
The hash code itself has been ported from a 32 bit version of this hash algorithm.
Each thread hashes input data by calling the hash compression function a lot of times (typically some 1000s).
The code measures the real time used for processing the task on GPU and on CPU.
Typical input data i have translates to the command line
./whirlpool 3000 48 4000
On my GTS250 i get about 4.8 seconds on GPU and 23.2 second on CPU yielding a speedup less than 5.
On my Tesla C1060 i get a speedup of about 6.
Similar code for other hash algorithms (SHA*-family) give speedups of at least 35 for GTS250 and 50 on tesla.
Especially important to me seems speeding up the function whirlpool_trafo.
I have tried everything i could think of but nothing pushed the efficiency of this code.
I tried to preload data to registers, ordering the byte accesses in different schemes, using shared memory for L and accumulate results instead of one big instruction, using the input data K word wise and other things.
Since i am just at the beginning of my cuda career i want to learn from that.
Thus please tell me, how you try improving the code, which tools to use and why you change which code!
Other parts of my application use shared memory for caching. This is already reserved in the kernel launch.
If this matters anyhow: I work on linux.
Thanks for any help on this in advance!