CUDA MD5 cracking speed vs. ATI

Found this blurb on slashdot.org

Yesterday at Black Hat USA 2009, a talk entitled MD5 Chosen-Prefix Collisions on GPUs (whitepaper) (Both PDFs) presented an implementation written in assembly language for ATI video cards that achieves 1.6 billion MD5 hash/sec, or 2.2 billion MD5 hash/sec with reversing, on an ATI Radeon HD 4850 X2

and I am starting to wonder - how would CUDA stack up against this benchmark? Any takers?

1.6 billion per second sounds like an awful lots of hashes. ;)

Christian

It looks like there’s a few MD5 implementations discussed on the forums…but nothing with 1.6 billion hashes/sec! Maybe updating some of these implementations for new devices would speed things up a bit.

http://forums.nvidia.com/index.php?showtopic=71548

http://forums.nvidia.com/index.php?showtopic=62250

http://3.14.by/en/read/md5_benchmark

http://www.elcomsoft.com/edpr.html

http://forums.nvidia.com/index.php?showtopic=95289

Pretty interesting stuff.

Just found a screen shot of BarsWF (CUDA multi-GPU + CPU) doing 1.7 billion per sec ;-) So not all hope is lost.

you forgot the link:

http://www.blackhat.com/presentations/bh-u…d-MD5-PAPER.pdf

Note also how this guy adapted 4 PCI Express cards on 1 PCI Express slot on a cheap motherboard, looks like his application does not require high speed HOST->GPU bandwidth. Maybe the lastests ASUS 7PCI Express mobo could be adapted with 3 flexible connectors to have 7 GPUs in a node? That would be great!

nVidia’s GPUs simply can’t compete with ATI’s on integer calculations.

Peak performance for single HD4850 is 160 * 5 * 0.625 = 500 * 10^9 integer operations per second.

Peak performance for single GTX295 is 2 * 240 * 1.242 = 596 * 10^9 integer operations per second.

One MD5_Transform (on 64 bytes blocks) requires about 640 operations (64 iterations with 9-10 operations each + initialization/finalization. Plus you don’t need full 64 iterations in most cases), which leads to numbers like 800M hashes/sec on single HD4850.

Though 800M is really high value, I was only able to reach something like 715M on single HD4850. Password generation itself takes some time (we need some texture sampling to build-up password from charset, etc).

You can find my MD4/5/SHA1 brute-forcer here – http://www.golubev.com/hashgpu.htm. (Though last version isn’t best for nVidia’s GPUs, especially slow on anything below GT200 as bottlenecked on texture fetching instead of integer ALU speed…)

would there be any way to reformulate the hashing algorithm to rely on 24 bit integers exclusively, or is this impossible? Is a potential speed-up for 24 bit ints limited to additions and multiplications only?

AFAIK there no difference between 24 & 32-bit additions, only multiplications faster on 24-bit ints. However even 24-bit mults aren’t faster than 32-bit shifts. Though mad32 can increase performance a bit (to help with cyclic rotation). Unfortunately there no such instruction.

Generally, nVidia’s GPU can compete only on SP FP where it’s possible to dual-issue FP MUL, so peak performance doubles. Of course algorithm itself also matters, sometimes it’s impossible to reach peak ALU performance. And sometimes it’s really hard to program ATI GPU to peak.

A little interesting bit on 24bit integer mul

I wonder whether that means NVIDIA is planning to speed-up 32-bit int mul or slow the 24-bit version down.

It might be both. I could imagine that future architectures would gain the circuitry to perform a 32-bit multiply with the same throughput as floating point operations. Once that existed, you could simplify the FPU and remove the circuit which casts 32-bit integers into floats (and vice versa) for __mul24 operations. Preserving the exact semantics of __mul24 in existing kernels, however, would then require several instructions to emulate the correct masking and rollover behavior.

I’m developing this little guy for the computer security club at RPI. (BSD licensed source will be published soon!)

http://rpisec.net/news/show/39

The best performance achieved to date on one computer is 1.2 billion MD5s/sec on four Tesla C1060s. Running on three computers I was able to break 2 billion.