Fastest Lookup Times

To help my self learn CUDA, I decided to do an NTLM hash bruteforcer. Currently, each thread checks a constant array to see if the hash the thread generates is in there. I came upon something interesting that contradicted what I learned in my CS courses: Linearly searching through the array is actually faster than a binary search. I believe this is because of the thread divergence, but I didn’t expect the a lineary search through ~256 elements to be faster than a binary search.

I am wondering something else now: If I implement my hash storage as a hash table will I get faster lookup times compared to binary and linear searching? Has anyone had any experience with this?

constant is very fast if all threads access same memory location, however, it still should not be 256 times slower.

For things like NTLM cracker hash table is indeed a good thing, especially if you plan to support many (hundreds and thousands) of hashes simultaneously.

Tell me if I am thinking about this correctly:

128 elements to search

Linear search:

128 * 1 = 128

128 lookups per thread, No warp divergence

Binary search:

log(128) * 32 = 224

Avg 7 lookups per thread, but with warp divergence.

256 is where they are equal.

I’m not sure if threads will be divergent. They will access different locations, but they will not be divergent.

Well, at least if you know number of items in advnace, you can implement binary search without branches.


Hash tables are usually best for simple lookups. Sure they’re wasting bandwidth but they should only need one lookup to find the answer immediately. (It does get trickier if you have a poor hash function or the table is full enough that collisions become common.)

If you are doing binary search in CUDA, it can pay to reorder your linear array into a 8-way sorted tree on G200 or an 16-way heap on G80.
The idea is that you’re reading a minimum of 8 words anyway (the memory transaction size on G200) so you can step log2(8)=3X times as fast by using the 8 values for deciding which chunk to visit next instead of one value at a time. On G80 transaction sizes are bigger so you may as well go 16 at a time and get a 4X speedup.

No need for pointers in the tree, its structure is determined by the size of the array alone so there’s no space downside at all.

Constant memory isn’t great for random array access because of the memory serialization of constant access.
If you’re dealing with such a short table, put it into shared memory and you’ll have FAST random access.
Even texture memory may be better. It’s not hard to try all three and just measure the speed.
Shared memory will win bigtime if your data can all fit.

That’s my understanding as well. The CUDA profiler helps determining the amount of divergence (unfortunately it doesn’t help fixing it).