A hash table is almost certainly the right idea. Just add each value to the hash table.
At the end, read the table entries right off.
To parallelize, split your data into N per-block lists, and run the hash in each block. Combine the results of all blocks just by concatenating the results, kind of a “hash table reduction”. Each block will likely have room for only a few thousand values in shared memory, so you can’t do too many values at once.
Repeat the “unique and merge” until you have a final list of unique values. You may end up doing the final rounds on the CPU where you have more memory for one big table… it depends on your data size.
Now, doing a hash table in shared memory is a fun algorithm by itself!
I’d do it something like a linear probe hash table, with a simple hash function of something like index=(value*0xDEADBEEF))>>18 . That gives you a sort-a randomized value from 0 to 4095. If hash[index] is unused (probably you initialized with a flag value like 0xFFFFFFFF) then write your value into that slot.
Else, try the next slot (index=(index+1)&4095) and test again. [dumb linear probe!]
If the slot is EMPTY (=flag value) then write your value into it. But this is tricky too, what if you get a write collision where multiple threads want to use the slot? Answer, after writing it, read it right again and see if the write succeeded. If it didn’t, some other thread “won” the write and keep linear probing. I think you can get away without using syncthreads(), since you don’t really care who wins, you’re just detecting that somebody won. You could use shared atomics to be sure, but a simple post-check will likely succeed… experimentation is needed. The PROPER way to do it is syncthreads() after each write, before the success check.
Reading the results off at the end is just visiting the shared memory array, finding any value that’s not the flag 0xFFFFFFFF.
Each block uses its warps to process say 2000 values. You can’t test too many values since your fixed size hash table has to hold them all, so adding 1/2 to 2/3 of the hash table size would probably be most efficient.
Your next problem comes if you have lots of input values but overall too many unique values so the GPU shared memory just can’t hold all of them in one block. The CPU version (using a MUCH larger table) would work nicely, but you could also do it on the GPU by sort partitioning after a round or two of unique reduction, something like each block computes the median of its values, then writes TWO lists out, one that holds the above median values and one with those that are below. Then your merge all the highs with the next pass, similarly for the lows, keep repeating. It’d take some bookkeeping but it’d work out where you’d converge to the right answer.