Asking for an algorithm about massive data processing

I have 1 billion strings ,each of which has a maximum lenght of 16 , stored in disk. How can I select the strings which appears multiple times? And how can I use GPU to accelerate the processing?

Thanks for any suggestion!

This problem is limited by your hard drive speed, not by computation, so your GPU will likely not be needed.

The basic strategy may be to hash each of your strings down to 64 bits. Make a hash table of perhaps 2 billion entries. Insert your hash value into the table using the low bits of its hash as an index,
and store the high bits as your data. If there’s a collision, you could use linear probing or double hashing to find a new slot for it.

Now as you store, if you see an existing hash set already, you know (almost certainly, error of only 1 in 2^63) that you’ve seen that string before.

This is all very very fast, and can be done on the CPU. Your entire time will be waiting for data to spool off of your hard drive at 200 MB/sec if you’re lucky.

This algorithm will work for any length string, including variable lengths if you like, or structures.

The only downside is the memory used for the hash table, which is probably 8GB or so.

Wouldn’t it help to store 4 characters of each string (or a 32 bit hash of the string) in the 4GB available memory of a Tesla card?
It might at least speed up the detection of likely collision candidates.


If this is the only processing you’re going to do, I would recommend an approach like SPWorley describes. Reading the data from disk and communication to and from the GPGPU will be your bottleneck.
With standard hash tables this should not be that difficult.
However, if you need the sequences on the gpgpu for more processing and the sequences are pretty much unique, do as much as you can on the gpgpu. Do you have any idea about the frequency of the sequences?

Is this a one time event? Or maybe a few times each year? How about oldskool linux:
cat file1.seq file2.seq | sort | uniq

Thanks for you reply.But we have only 4GB memory at most, and what’s more, the number of strings may not be limited to 1 billion, which may be 10 or even more…

So do I have to use disk to save the hash table? Or there’s other methods to make some trade-off between memory and speed?

I think there must be a lot of collisions if using only 32bit(2^32=4G logic space) hash value, because I have more than 1G strings.

Do you mean that a sequence does not appears absolutely random? It may not.

This program may be used frequently, so the convenience in coding is not worth the efficiency.

But I’m wondering whether I can achieve better performance than “uniq”…

Yes, you’ll need to use a collision resolution method like linear probing or double hashing.

Disk speed is going to be your limitation, so you don’t want to use more of it.

But if you have this added issue of only having a limited amount of RAM, it becomes trickier since you need to save enough state to detect the collisions!

But here’s a compromise to keep RAM use down, at the expense of speed. You use a two level hierarchy of hash tables and take two passes.

Allocate two tables of perhaps 1GB of RAM which you’ll use as a simple bit-wise existence and duplicate table. Basically table 1 will have a bit set if “I have seen this hash before” and table 2 will be set if “I have seen this hash before more than once.” It will not definitively resolve collisions, but is just to weed out known unique values so they don’t waste slots in the “big table.”

Stream in your strings. Convert each to a 64 bit hash. Take the lower 30 bits of the hash value and set the single BIT it corresponds to in the first 1GB table. If that bit is already set, set the same bit in the SECOND table. [* There’s an extension here where you can make the first table larger in size than the second, forming a three level heirarchy, but that gets more complicated.]

At the end of this, throw away the first table. You now have just one 1GB table of 8B bits. A bit is set if the hash value is POTENTIALLY a duplicate. (It still may not be due to collisions.)

Now start over and stream your data again. Test each string in the bit hash, and if the bit is set you have a potential collision,

so also add the 64 bit hash value into your larger detailed 3GB hash table. If you get a collision in that second table, you’ve found a duplicate.

Since not every string will be stored in the detailed hash table, you’re saving a lot of RAM.

You can balance the ratio of the size of the bitwise and the detailed hash tables based on your expectations and experiments. For 10B strings, you’ll need a bigger bit hash, likely you’ll want to have twice as many bits as strings or so, perhaps more.

If your number of strings becomes REALLY huge, more than say 20B, you’d switch to a multipass algorithm… the first pass through you’d only look for dupes if the hash has a high bit set, the second pass looks for low bit set dupes. This kind of partitioning can be extended to any size problem you want by just using smaller partitions and more passes.

At 10B strings of 16 bytes each, that’s 160GB of data. Assuming a good hard drive read speed of 100MB/sec, the job should take an hour or so. CPU use will be low. Speed will be slower (likely significantly) if your strings aren’t tightly packed on the hard drive, so your speed will depend on that encoding (and drive speed of course) more than anything else.

I’m new to CUDA, and my first attempt to parallelize an algorithm has this problem as a subproblem.
Given the hashing approach, which hash algorithm is best suitable for CUDA?
My sequential program uses CRC16, which was OK concerning equal distribution for that
kind of problem.
Are there reference algorithms for kernel implementations of hash algorithms?

Best regards,

Is this for some kind of homework or contest? There’s a thread on Stack Overflow about it from yet a third person.


at least mine is not related to a contest, it is the attempt to speed up a program which finds the lowest number of pushes for the “sokoban” game (Linux version here, for example:

The subproblem exists in a graph search algorithm (BFS), the algorithm needs to know if a node has been visited already. Because the graph is not stored explicitly as a node/edge list, but implicitly given through the game’s rules,

it is not known if a node has been created already. For a quick search, a hash table is used, using the crc16 as the hash value.

I do this just for the purpose to learn CUDA techniques.

Best regards,


Thank you ,SPWorley. Your suggestion seems doable and efficient for me. I’ll try it!

So what is your application? Especially since multiple other people are interested in implementing it too.

I have two word set A and B, my application is to calculate the covering rate of B using word set A. So I have to ensure every word in set A unique first. Another method is saving the matched words in set B when traversing set A to avoid multiple counting of a duplicate word, which could be easier to implement with fewer resource demand. But I think writing a separate module to make every element unique may be useful for furture.