Hash tables could work for this purpose. You need a fast hash function (not necessarily with cryptographic strength). Then do the atomic counts based on a resulting integer hash. If you use e.g. 24 bit hashes, the entire table would still fit nicely into global memory. Hash bins that result in a collision will require special handling though.
Another idea may be to construct a tree from all the unique strings. The tree maps each string to an index stored in the leaf nodes and the atomic counting of the string histogram will be based on this index. The tree will be traversed like a decision tree: each level n of the tree corresponds to the n’th character of a string. The terminating null character of each string also has to be considered in the tree. Whenever a new string is encountered that differs at the given character (or level), you open up a new branch with a new index at the leaf node (use another atomic counter to generate a unique, incremental index)
General advice: before trying anything in CUDA, you should be able to solve the same problem in C or C++. It does not look like you have a concept of how to address the problem yet. So going to CUDA is too early at this stage.