Speedy CUDA tool to win EngineYard SHA1 contest

Added another 5% speed boost by adding a simpler w array pop to the final rounds where we don’t need to update the w array anymore., but now to work on the 12-word feature, that’s not a speed issue but it’s critical to follow the contest rules.

I made a test dictionary and some random sentences with hash values, if you want some test data:

Dictionary:

initiate

postage

incidental

homestead

wailing

eggplant

squall

underwear

friar

warble

terracing

okay

ruffle

subsistence

unpleasantness

terrestrial

accolade

peeling

orchestration

sunset

oppressor

enigma

goblin

capsizing

deepening

lithium

pageant

participation

rattlesnake

snip

Test Values:

f42e9e861f7ba01037bbf64e0a86b5206717d10c

unpleasantness initiate capsizing orchestration postage terracing oppressor ruffle rattlesnake peeling wailing deepening

f0094a7c61fbf8cf593d16c827a394b55bccd510

goblin squall terracing accolade capsizing postage ruffle subsistence pageant initiate participation underwear

1aaeebbbf92861845aafbcca3b7284984857a657

friar okay subsistence underwear warble pageant oppressor terracing wailing deepening accolade ruffle

59f5de5339f8b3aa831ac6bb0dff01d8d5d5e521

subsistence participation rattlesnake incidental pageant oppressor warble underwear deepening capsizing snip friar

5071871ba767331d2710759f12f750eb202b83ea

unpleasantness oppressor pageant snip postage terrestrial enigma rattlesnake subsistence goblin wailing participation

f01d158a9a3c167c778df0cfd1b866b8a8ade201

wailing incidental okay snip peeling capsizing terracing underwear friar deepening goblin squall

a4b93f5368c46c44edad4935fc87f60a9502c407

wailing incidental snip sunset capsizing rattlesnake homestead pageant oppressor ruffle okay friar

3aeb759332cde615ffd61db79956ad0ba34e9312

enigma homestead terracing deepening incidental subsistence goblin orchestration wailing initiate friar warble

ccadeb7b62971c7d68a2fc43ff5aef0769e4b827

lithium participation accolade postage oppressor ruffle capsizing orchestration sunset friar wailing snip

907bb2e32a3583de0dd9bc6dfc52fcb9448141ac

friar squall pageant capsizing goblin participation snip deepening incidental initiate postage oppressor

6256b670bc91374318d78c58f23e327b4d7b0324

peeling capsizing sunset wailing initiate enigma warble pageant snip accolade participation underwear

88a1d646553ca3d71c92fc479609d0b4783e7d1e

orchestration snip participation subsistence eggplant terrestrial unpleasantness friar initiate postage oppressor homestead

11607b2d54ea9920a8d9ed55f6d4de2449aac8aa

homestead sunset goblin friar initiate snip ruffle orchestration okay subsistence squall underwear

9057f006abb4579054cb9f1599db97ffe0949e3d

lithium subsistence orchestration incidental ruffle snip wailing pageant initiate goblin postage underwear

c710b2b77ded55b0bebee5e19f2b77e18f1d6cc1

subsistence sunset enigma snip oppressor warble friar squall unpleasantness capsizing terracing terrestrial

d03cbd0c7c0b1797588217e807c453ab69d07b49

ruffle homestead deepening rattlesnake snip oppressor incidental warble subsistence participation unpleasantness wailing

266581d17b52e652f89b9cfb256ead191dafbf35

warble snip rattlesnake wailing subsistence orchestration deepening participation friar terracing lithium incidental

c177d2c3c072ccb98a9b97e69c10f9ed29c7f02f

oppressor initiate peeling rattlesnake squall wailing deepening snip okay terrestrial incidental sunset

43433537e1b2dcf755d67c7afe2baaa94813bf7e

warble squall terrestrial sunset subsistence friar postage orchestration terracing ruffle homestead unpleasantness

5744bd83e6c38418c84a4fb674cff32d60b6e6b3

warble squall oppressor eggplant deepening postage lithium initiate subsistence homestead friar incidental

I have some doubts about speed calculations in 0.11. Looks like threads doing 6464 iterations but overall stats based on old 9393, so it shows bigger numbers than it really computes ;).

You’re right! I was surprised that there was an extra factor of 2 in the speed!

I did a little poking around, and sadly, there are no clever mathematical improvements to be had (in the next 12 hours anyway):

[url=“Ever Better Cryptanalytic Results Against SHA-1 - Schneier on Security”]http://www.schneier.com/blog/archives/2009...better_cry.html[/url]

There are SHA-1 collision attacks now which can get you down to 2^52 hash operations, however there do not appear to be any public pre-image attacks, which would be required to win this contest. :(

So the best way to win this contest is still luck and having the most transistors devoted to computing hashes…

Sure,

It’s pretty much the same strategy you are using. Randomly select 12 words from the list and randomize the case. Then iterate through the possibilities for the 5-char word. Using openssl’s api it is easy to store the hash state of the first block after selecting the 12 words and only compute the last block of each trial.

EDIT: Attached the 0.11 version compiled for Mac OSX.
gpusha1_0.11.zip (65 KB)

That’s part of the way I’m approaching the program (the reduction method). If I can’t get something working within an hour or two, I’ll just post up what I have so that you guys can have a crack at it. I’m taking a very different angle to the problem, so perhaps we can figure out which way works best and try to come up with a hybrid solution tonight.

Oh, I also wanted to ask you guys…does anyone know what the maximum length of the words in the dictionary? I tried asking on their website, but my comment was deleted for some reason. I’ve got some good optimizations I can make depending on how long the words are. I assumed 16 for now, but less than that (like 12 or 14) would be even more awesome.

They don’t say, but it’s likely that the words wll avergae 5-7 characters. I just updated my code to make sure the base 12 picked words sum between 64 and 110 characters including padding spaces. It’s really unlikely, but they could play a game and release a dictionary of all 2-4 letter words, and my code will fail, but in that case I’d need to run back and make a 1-block version (again, like version 0.11)

Version 12 posted.

This now works with the full 12 word requirement, so this is now usable for the contest.

GTX 295 (both halves): 312 M/sec

GTX 285: 189 M/sec

Ok, now with a working tool, we have questions about speedups!

Some optimizations now worth investigation:

  1. rearrange the host to launch async kernels. the CPU can do its own analysis of a launch results while the next kernel is running.
    Estimated speedup… ??? I can’t even guess! 1%? 50%?

  2. Precompute the first few constant rounds of the hash function. This might save 10 rounds of the 80.
    Estimated speedup: 10%.

  3. do a reduction at the end of the kernel instead of a one-thread accumulator
    Estimated speedup 2%

  4. After multiple kernel launches, run one reduction kernel to find best of 10,000 blocks. Report just that one result.
    Estimated speedup 5% (seibert showed the xfer overhead wasn’t an issue)

  5. switch the per-block intiialization from 939393 characters to 646464 to save divides
    Estimates speedup 1%

  6. Switch shared memory w array to 16 labeled registers
    Estimated speedup -1%. (register shuffling is slower than shared indexing) But it gives much better flexibility on kernel thread population, which might give 5%?

  7. experiment with block sizes, perhaps auto-tuning them to find a block count with maximum throughput
    Estimated speedup 3%

OK, I admit that all those estimates are very vague… it’s hard to estimate! But they give an idea anyway.

I think the async kernel launch has the best chance for a decent speedup, though that’s far from certain. I’ll experiment with that first.

What do we think the best host strategy might be to keep the GPU busy?

Riight now the loop is basically

for (;;} {
Launch kernel (30ms)
memcpy back results (4K or so of data)
analyze results on CPU (quite fast but nonzero speed), it’s a loop over 1000 values to find the minum.
}

Some ideas for increasing GPU usages

PING PONG method.

Launch kernel & memcopy on stream 0
Launch kernel & memcopy on stream 1
for (;;) {
Wait for stream 0.
analyze 0’s response.
Relaunch stream 0 with another kernel & memcpy
Wait for stream 1.
Analyze stream 1’s response
Relaunch stream 1 with another kenel & memcpy
}

Or, alternatively
ACCUMULATOR

Launch kernel 1, in a loop 200+ times. Each kernel appends its answer to the end of a results array.
do one memcpy from device, analyze all at one go.
repeat.

The ping-pong seems like it will work and also give instant feedback every few seconds which is nice.
The accumulator is simpler and feels more efficient due to the single memcpy.

Other ideas? The accumulator method could combine with the “Add a final device memory reduction kernel” idea too.

Well, since I don’t know if I’ll get my program finished by tonight (and I’m only willing to spend a little more time on it), I’ll post my code. If you take a look at it, I’ve taken a bit of a different approach from SPWorley. Basically, my program packs each word into a 16 byte char buffer (‘SentenceWord’), and then 12 of those char buffers into another struct, called ‘SentenceWordBlock’. It also takes a struct of chars, where each one represents the actual length of each word, a parameter containing the target hash value, and two index values, where each index represents one of the 12 words.

Then, I set it up so that it uses the locations in the grid and thread block to calculate two 16-bit values which correspond to the two selected words (of the 12 words that were randomly selected from the dictionary and passed to the kernel). Each bit in the 16-bit permutation value denotes whether the letter is counted as uppercase and lowercase when the SHA1 hash is computed. Because the kernel also knows the length of the words (from the parameter with the word lengths), the SHA1 hashing function will also know where to insert values for the spaces ‘on the fly’.

Once the thread has computed the hash for it’s particular permutation, it saves the thread location and hash ‘score’ (the Hamming distance between the computed hash and the target hash) into shared memory, and does a reduction/sort inside the thread block. Then, the first thread in each thread block writes it’s best score to global memory, where another reduction is performed to find the best value that was generated by the grid. The host can then read back this value and save it in a variable, so you know when you’ve found a value ‘better’ than any of the previous values. You could also use the bit permutations from that value to reconstruct the permuted string to print it out, etc. (which you’ll need to do if you want to submit an answer).

The host program simply needs to choose 12 words from the dictionary (any 12 words that you’d like) and call the kernel repeatedly, looping over which two words are chosen as the permutation words. Assuming that you use a ‘triangular’ loop here, there will be 78 choices of word pairs.

I think this should be a pretty good solution, since it’s almost totally compute bound. Only the reduction at the end reads/writes to global memory, and very little data is transferred back and forth between the host. The kernel is launched with a block size of 256 (for good occupancy) and a grid size of 32768x512, so it should keep even the fast cards busy (256 threads times 32768x512 blocks is 2^32 values). It should also scale across any number of cards very easily.

Also, I found a SIMD algorithm for SHA1 computation that I’m planning to use:
[url=“SHA1 using SIMD techniques”]http://www.arctic.org/~dean/crypto/sha1.html[/url]

EDIT: The code doesn’t compile right now, I’m in the middle of writing the hashing function. So it’s for reference only.

EDIT 2: A little more info – I’m not sure if the block size of 256 is best, I might change it to 128 or 64 (I’ll have to test it when I get it working, and make sure that each thread has enough memory).

This version seems to be broken for me. It runs full out (claiming 600+ M hashs/s on a 8600M) and always tells me the best score is 0.

Attached is the mac binary I’m using. Compiled the same way as the 0.11 version which worked.
gpusha1_osx_0.12.zip (65 KB)

prof, thanks for the code! I see you really trying to organize the thread reports.

Why use a bitonic sort on the thread results? Since we’re only concerned about the best one, why not just a min reduction?

Also I see this gets the best values for each block and writes them to the device memory, but do you plan a second kernel to do the reduction of all those block results?

Sounds like a kernel launch failure by the symptoms.

I did find a memory leak but that’d kick in only after days of computing. Hmm.

I’ll scan to see if there’s something else obvious.

Yep. If each thread computes one permutation, that makes 2^32 results. Then, a reduction is done on the block to get the minimum value (however you want, bitonic sort, min value, etc.) which should reduce it to 2^24 (for thread blocks of size 256) items. Then, you can do another reduction on that global memory array to determine the minimum result of entire grid. I was going to try to jam everything into one kernel to avoid having to call multiple kernels (to minimize the latency), by just making sure to use __threadfence() and __threadfence_block() where needed.

EDIT: I forgot to mention earlier, that I thought of another optimization for my code, which would be to launch the grid with only the number of threads needed for the number of characters in the two strings. If you have choose to permute two small strings (say, with four characters each), you’d be wasting a lot of computing time. You could probably adapt it to use the struct containing the word lengths, and use the given word indices to calculate what to permute and so forth…but it’d make it much more complicated, and I don’t even have it working yet ;)

Ah, the atomic memory trick so the final lone block can do the reduction. That’s exactly how it should be done! I remember smiling when I read the elegance of that approach in the SDK’s new threadfence demo code.

My thought of a second kernel launch is still old-school CUDA.

The register usage is too high for a compute 1.1 device. Each thread requires 58 registers, so for a block size of 192, that’s 11136 registers per block. Dropping to 128 registers per block will work for devices < compute capability 1.2.