CUDA MD5 cracking speed vs. ATI

cbuchner1 · August 1, 2009, 11:11pm

Found this blurb on slashdot.org

Yesterday at Black Hat USA 2009, a talk entitled MD5 Chosen-Prefix Collisions on GPUs (whitepaper) (Both PDFs) presented an implementation written in assembly language for ATI video cards that achieves 1.6 billion MD5 hash/sec, or 2.2 billion MD5 hash/sec with reversing, on an ATI Radeon HD 4850 X2

and I am starting to wonder - how would CUDA stack up against this benchmark? Any takers?

1.6 billion per second sounds like an awful lots of hashes. ;)

Christian

jph4599 · August 1, 2009, 11:39pm

It looks like there’s a few MD5 implementations discussed on the forums…but nothing with 1.6 billion hashes/sec! Maybe updating some of these implementations for new devices would speed things up a bit.

http://forums.nvidia.com/index.php?showtopic=71548

http://forums.nvidia.com/index.php?showtopic=62250

http://3.14.by/en/read/md5_benchmark

http://www.elcomsoft.com/edpr.html

http://forums.nvidia.com/index.php?showtopic=95289

Pretty interesting stuff.

cbuchner1 · August 2, 2009, 10:15am

Just found a screen shot of BarsWF (CUDA multi-GPU + CPU) doing 1.7 billion per sec ;-) So not all hope is lost.

nuliknol · August 2, 2009, 2:28pm

you forgot the link:

http://www.blackhat.com/presentations/bh-u…d-MD5-PAPER.pdf

Note also how this guy adapted 4 PCI Express cards on 1 PCI Express slot on a cheap motherboard, looks like his application does not require high speed HOST->GPU bandwidth. Maybe the lastests ASUS 7PCI Express mobo could be adapted with 3 flexible connectors to have 7 GPUs in a node? That would be great!

empty_knapsack · August 3, 2009, 7:31pm

nVidia’s GPUs simply can’t compete with ATI’s on integer calculations.

Peak performance for single HD4850 is 160 * 5 * 0.625 = 500 * 10^9 integer operations per second.

Peak performance for single GTX295 is 2 * 240 * 1.242 = 596 * 10^9 integer operations per second.

One MD5_Transform (on 64 bytes blocks) requires about 640 operations (64 iterations with 9-10 operations each + initialization/finalization. Plus you don’t need full 64 iterations in most cases), which leads to numbers like 800M hashes/sec on single HD4850.

Though 800M is really high value, I was only able to reach something like 715M on single HD4850. Password generation itself takes some time (we need some texture sampling to build-up password from charset, etc).

You can find my MD4/5/SHA1 brute-forcer here – http://www.golubev.com/hashgpu.htm. (Though last version isn’t best for nVidia’s GPUs, especially slow on anything below GT200 as bottlenecked on texture fetching instead of integer ALU speed…)

cbuchner1 · August 3, 2009, 8:28pm

would there be any way to reformulate the hashing algorithm to rely on 24 bit integers exclusively, or is this impossible? Is a potential speed-up for 24 bit ints limited to additions and multiplications only?

empty_knapsack · August 3, 2009, 9:06pm

AFAIK there no difference between 24 & 32-bit additions, only multiplications faster on 24-bit ints. However even 24-bit mults aren’t faster than 32-bit shifts. Though mad32 can increase performance a bit (to help with cyclic rotation). Unfortunately there no such instruction.

Generally, nVidia’s GPU can compete only on SP FP where it’s possible to dual-issue FP MUL, so peak performance doubles. Of course algorithm itself also matters, sometimes it’s impossible to reach peak ALU performance. And sometimes it’s really hard to program ATI GPU to peak.

_Big_Mac · August 3, 2009, 11:44pm

A little interesting bit on 24bit integer mul

I wonder whether that means NVIDIA is planning to speed-up 32-bit int mul or slow the 24-bit version down.

seibert · August 5, 2009, 2:27am

It might be both. I could imagine that future architectures would gain the circuitry to perform a 32-bit multiply with the same throughput as floating point operations. Once that existed, you could simplify the FPU and remove the circuit which casts 32-bit integers into floats (and vice versa) for __mul24 operations. Preserving the exact semantics of __mul24 in existing kernels, however, would then require several instructions to emulate the correct masking and rollover behavior.

azonenberg · August 13, 2009, 10:52pm

I’m developing this little guy for the computer security club at RPI. (BSD licensed source will be published soon!)

[url=“RPISEC | Computer security club at Rensselaer Polytechnic Institute.”]RPISEC | Computer security club at Rensselaer Polytechnic Institute.

The best performance achieved to date on one computer is 1.2 billion MD5s/sec on four Tesla C1060s. Running on three computers I was able to break 2 billion.

Topic		Replies	Views
MD5 hash on a GPU: Implementation and benchmarks CUDA Programming and Performance	12	50598	August 26, 2016
MD5 on CUDA - is it possible ? CUDA Programming and Performance	3	3245	June 26, 2007
Is ATI Stream better for encryption-type programming? CUDA Programming and Performance	13	17650	November 9, 2010
AMD Radeon 3x faster on bitcoin mining SHA-256 hashing performance CUDA Programming and Performance	70	48027	July 2, 2013
New to CUDA Some troubles with computing MD5 CUDA Programming and Performance	30	11589	November 28, 2008
CryptAnalysis on CUDA CUDA Programming and Performance	4	24299	April 26, 2009
What's the peak performance with 32-bit integers? CUDA Programming and Performance	5	3278	July 11, 2009
Problem while cracking MD5 It seems that a GPU thread can not continually run a little long CUDA Programming and Performance	3	799	January 8, 2011
Slow perfomance when calculating SHA256 CUDA Programming and Performance	8	7057	January 3, 2018
CUDA vs ATI Stream comparison CUDA Programming and Performance	22	93962	March 12, 2010

CUDA MD5 cracking speed vs. ATI

Related topics