Hi folks,
I’m trying to generate a lookup table. example query = ACTGGTC, example related data = 1,200483

I have some strings, with which I want to associate some numeric data. Ideally I would be able to set up a Perl style hash where they keys are my strings, and the values are my numeric data.
I have read teh sequence processing related papers, but haven’t found the magic bullet. I would like to use textures for this, but really, if anyone can tell me any way to do this I’d be happy.

I’m not sure I understand where CUDA fits into this situation though…can you elaborate a little more? Are you trying to generate hashes from the input strings using CUDA?

Here is some more detail about what I would like to do:

I have a number of short DNA sequences, and their respective genomic locations, a simple example:
AAAA 1111
CCAA 2222
CCTT3333
ACTG 4444
etc.

I have many of these (think 10000 to many millions). I would like to be able to load these pairs into memory on the device and then, when presented with a sequence such as AAAA, get back the related genomic location.

I have an idea for you that might work, but first, the two associated problems:

If you have strings longer than a certain length (I think something like 6 characters, for the method below) it won’t work due to integer overflow.

If some sequence are found a large number of times, while others are only found a few times, you might have some issues allocating memory…but you could always do some kind of histogram, transfer back to the host, allocate the memory for each sequence (for it’s location pointers), and go from there.

Here’s the idea:

Since you only have four ‘characters’ (i.e. bases), you can use a simple method to identify each string uniquely (whether on the host or the device). Assign each 2-tuple of the base and the position in the string, e.g. (A, 3) a prime number. For example:

Position 1:
A = 2
C = 3
G = 5
T = 7
Position 2:
A = 11
C = 13
G = 17
T = 19
etc.

Following that pattern, ACTG would be assigned the ID of 45214 (2 * 13 * 37 * 47). Due to the properties of prime numbers, no other string would ever be able to have this ID, so whenever you see 45214, you would know that the represented string is ACTG.

Then, to use this method, you could first calculate the ID numbers of each sequence you’re going to use (which could be calculated once and stored for re-use). Convert the sequence array into a 32-bit or 64-bit unsigned integer sequence/array and copy to the device. You could then perform a histogram on the values to determine how often each one occurs. If you’ve already allocated 4*N bytes (where N is the number of sequences you are processing), you can use the cumulative values of the histogram to compute the ‘starting’ memory location for an array of device pointers in memory that will store pointers to wherever that sequence appears in your original input DNA.

Note that you could also use the prime number trick ‘recursively’; that is to say, if you know you’re only looking for any of the 4-base-pair combinations, you could assign them to the prime values in order (i.e. AAAA = 2, AAAC = 3, AAAG = 5, AAAT = 7, CAAA = 11, etc.) and then use the same method as described above to reduce the amount of data required for each sequence (and so, fit longer sequences into the device memory).

P.S. Oddly enough, I learned this method when reading about computer-based poker hand evaluators.

THis is a good idea for short sequences where the combinatorics don’t get too big, nice work.
Unfortunately, the example I was giving was a simplification of the problem.

In practise I would have sequences of up to 50 characters.
I had thought of encoding each letter as a 3 bit representation (there are five characters A,C,T,G,\0), so then a string of 50 characters would be 50 bits, or 19 bytes. But unfortunately 2^150 goes way outside of my addressable space.

So are you just trying to determine the locations in the overall DNA sequence for each smaller sequence? If so, you might still try some sort of recursive tree method based on what I wrote above, then do a scan or reduction of some kind of combine the trees and determine which actual sequence you have found.

The product-of-primes idea is clever. Especially for poker hands, where they are invariant to the order of the cards, because the encoding is automatially invariant also. But in this case, 4 possible letters fits very well into 2 bits. I’d say record a length (one byte) and then four bases per byte after that. So simple. When the encoding is longer than your hash table key, you can reduce it by calculating the modulus with respect to some appropriately large prime number.

But I’d say the hash function (or whatever encoding) itself is not the most difficult part. Actually implementing the hash table is complicated, especially trying to update the data without thread conflicts and race conditions.

If your values are fixed it may be simpler to sort them once and just do a binary search, which should be straightforward to implement. Then if your implementation is still too slow you can consider a hash table.