Why is whirlpool hash so slow on cuda?

Hi

Background: I try to find pre-images of hash values using cuda.

I implemented some hash algorithms (sha-variants and ripemd160) as cuda code and finally (after some streamlining…) got an impressive speedup with my GTS250, GTX285 and Tesla 1060 cards.

Now i tried it with Whirlpool and started basically with the author’s reference implementation from their web site to give it a first shot.
But unlike all the other hashes this gave a slowdown with respect to any of my cpu implementations.

First i blamed it on the usage of long long ints and replaced them with an 32 bit implementation but there was no speedup either.

Perhaps this has to do with the implementation’s extensive usage of lookup tables.
Thus I tried to put them into constant memory but that only helped a little bit.

Do you know of any cuda whirlpool implementation with source code which actually is faster than a cpu version?

When i understand the underlying mathematical structure i will try to replace the lookup tables by (a lot of) calculations hoping for better performance.

Thanks for any help in advance!

This is probably the issue. Lookup tables in global memory will be very slow, and constant memory will only be significantly faster if every thread in the warp reads the same table entry.

How big are the lookup tables? Would they fit in shared memory?

Together they are about 8 KB.

I will try your idea immediately.

I need some shared memory for other things, too, but i will check if it fits all together.

Thanks

Martin

Textures might be even better than shared memory, since you won’t have any control over how you read from different banks of shared memory (I’m assuming the lookup locations are essentially random). If it is only 8k, that should fit in the L1 texture cache.

Over 100 000 people involved in GPU mining are waiting for “right” cudaMiner- software program ( highly optimized to yield better performance for nvidia cards)

NVidia GPUs are much worse for crypto currency mining than AMD GPUs.If you check mining hardware performance, you’ll find that, AMD Radeon 270x for example can mine about 375 kha-XMR (cost 160 euros) comparing against NVidia’s 750 ti at 215 kh/s (cost 155 euros)

That is the only reason why people in mining world prefer amd cards.( 80+ amd cards i had bought last year - so do the math…)

Just one quote from our forum

If there was Nvidia miner for XMR,BBR and every other new algo out there they would probably sell thousands of cards for mining every week

I hope good things come to those who wait - please check this thread https://bitcointalk.org/index.php?topic=167229.0

Can you even make money these days with AMD gpus after considering power costs? I thought the only people doing well are those using ASIC setups.

Also the GTX 780ti is a good step up from the old gpus everyone complains about. I can get over 1,100 Giops(integer operations per second) out of the GTX 780ti (and over 4.1 teraflops for 32 bit float matrix multiplication).

Keep in mind most people using CUDA are working on image processing or scientific research. Most of my work has been related to the computational medicine and I have yet to see anyone using AMD gpus.

Also for games Nvidia kills AMD(based on user submitted tests):

http://www.videocardbenchmark.net/high_end_gpus.html

for PassMark - G3D Mark:

Nvidia GTX 780ti -> 8,978 score

AMD 290x -> 6,771 score

This is a zombie thread, but I will still reply to the original posting.

whirlpool is an AES derivative.

Most AES implementations on the CPU use t-tables to do fast lookups of expensive transformations in the GF(2^^8) (either the lookups cover just the AES S-Box or sometimes even more transformations folded into these tables)

If you implement it on a GPU, either place the t-tables in shared memory or in texture memory for best performance. Constant memory is not well suited for your application because it works best if all threads within a warp access the same element (broadcast read). With AES this is typically NOT the case.

It may be even faster to do a bitsliced implementation of the S-Box. This worked great when we implemented the Groestl hash function for ccMiner.

For 64 bit shift operations, be sure to use the funnel shifter when available (the built-in << operators for unsigned long long variables don’t seem to do it properly, at least up to CUDA 6.0)

Also if your hash uses more than 64 registers per thread on Compute 2.0 and 3.0 architectures, spillage to local memory will slow down execution notably. Consider using e.g. 4 threads simultaneously to compute one hash, spreading the state variables over 4 threads. For inter-thread data exchange this may require the use of warp shuffle (available in Compute 3.0 and later) or shared memory (on Compute 2.0).

“NVidia GPUs are much worse for crypto currency mining” is blatantly false.

The public implementation of the XMR miner for nVidia isn’t good. I acknowledge that. It is also just a few days old. It was written by a bitcointalk forum member named “tsiv”.

Our private implementation (which I can’t share) was optimized with Compute 3.0 in mind and runs great on amazon EC2’s Grid K520 GPUs. Of course we want to keep this competitive advantage because it SCALES. And we want to remain the only ones who SCALE ;)

Generally speaking the ccMiner software has historically often implemented new mining algorithms with a speed advantage over AMD. Only just recently AMD has caught up due to the work of “djm34” who fixed the OpenCL kernels to run nicely with AMD’s new driver revision 14.6. So nVidia and AMD are about equal in terms of performance now, but nVidia’s 750Ti gets to keep the advantage in the power efficiency domain.

@ CudaaduC
Yes, GPU mining is still profitable ( ROI in less than 10 months )

@ cbuchner1
When you released ccminer,version with support for Jackpotcoin, i bought 30 new 750 ti cards.That was first “nVidia coin” after a long period of time in the top of most profitable coins

Yes, private implementation which you can’t share

Nvidia should/must start hiring people to optimize miners for every new algo! (that will skyrocket sales)

Should not depend on one man (cbuchner1, in this case)

well there have been contributions to cudaminer by nVidia and to ccMiner by other talented programmers. And don’t forget that the ccMiner came into existence because some other Christian joined me for a couple of months. And most recently tsiv submitted the X13 algorithm.

I think we’ve hijacked a zombie thread.

About whirlpool: we still don’t have it implemented in ccMiner, so no RouletteCoin or X15 hash algorithm for you ;)

Christian

Christian,

I truly appreciate your time and effort and I am impressed with what you have done to enable us to have such great toys “cudaminer & ccminer”

I think we all know the reason why AMD’s market share went from 40% to probably 70-75% in Q4 and Q1 8.

Amd releasing new cards obviously optimized for mining Tonga.Why? Crypto boom is NOT over, there is just too much room to grow.

We just need the right tools (Nvidia miners for XMR,BBR and every other new algo) knowledge,experiences and money we already have:)

@cbuchner1: You wrote: “For 64 bit shift operations, be sure to use the funnel shifter when available (the built-in << operators for unsigned long long variables don’t seem to do it properly, at least up to CUDA 6.0)”.

Looking at some test cases, everything seems to work fine with CUDA 6.0. For the architectures that include the barrel shifter (sm_35 and sm_50), 64-bit shifts with variable shift count are translated into pairs of SHF instructions, as expected. I am seeing the same for 64-bit with various constants shift factors I tried. I was a bit surprised to see the dual-SHF idiom used even when in a logical shift the constant shift count is a multiple of 8. I would have expected to see a pair of PRMT instructions in that case, but I do not have an in-depth knowledge of all the intricacies of tradeoffs between the multiple execution pipes in these GPUs so the dual-SHF idiom may in fact be best.

It is possible that there could be context-specific issues with the code generation for 64-bit shifts that don’t pop up in my simple test code. If that is the case, it would be helpful to file a bug.

Okay, I think my performance problems arose because a ROTL64 or ROTR64 macro like the one given below is not detected by the compiler as a construct that can be executed with only two funnel shifts. Instead the compiler generates two separate shifts followed by a logical or, just as instructed by the programmer.

#define ROTL64(v, n) \
  ((v) << (n)) | ((v) >> (64 - (n)))

I solved it by directly inlining funnel shift intrinsics when available.

That would appear to be an issue of a different kind. I know that the frontend of the CUDA compiler has some idiom recognition for rotates. But rotates are not expressible at the PTX level. Since the support for funnel shifts is not universal across all GPUs, architecture-specific code generation typically happens in the backend compiler (PTXAS), and 64-bit shifts themselves are emulated in an architecture-specific manner, I can see how this would result in complications.

Last time I checked, at least 32-bit rotates at were handled successfully. You may want to file an RFE for improved translation of 64-bit rotates, citing your specific use case. As mentioned, for shift counts that are multiples of 8, the use of byte permutation (either intrinsic or inline PTX) may also be a good choice, and will work on all architectures >= sm_20.