Hi, I have a problem that I need to find the duplicate values on an array, counting the number of times each element appear on it. This must be done for every element on the array.

I didn’t find any solutions or ideas on previous topics about this, so if somebody can point me to the right direction or idea I would appreciate it.

Let me describe my problem in a better way:

I have this array:

[ 1 1 2 2 3 4 ]

I know the range of the values in this array are: [0,4], so, every value in the array are at this range.

I want the ouput of this kernel to be this:

index: 0 1 2 3 4
[ 0 3 2 1 1] -> this will the output

So, I know how many times each value has appeared on the array.

The size of the array grows much larger, but not so much. However, my real array go like this:

[ 1 1 2 2 3 4 0 0 2 3 1 1 4]

Can you see the there are subdivisions on the array? (these subdivisions are known)

Any help on pointing a direction for me on this on GPU?

I have this implemented on CPU, which is a very easy solution, but it’s not parallel.

What about keeping a small ‘array’ in shared memory/registers for each thread that holds the number of times it has seen each number (in your case, 5 for each). Then, take a look at the final reduction kernel (the ‘best’ one), which shows a fast way for each thread to loop over a bunch of values. Once all the values have been processed, have each thread save it’s mini-array to global memory, then do a reduction to compute the sum of each mini-array.

Also, isn’t this basically a histogram? There are kernels included in the SDK for this, perhaps you could modify them for your needs.

OK, here’s a simple trick. Let a thread block have 128 threads. For each thread, let the corresponding array value be X, and calculate S = 1 << (8*X). Then use a parallel add within the block to sum up all S. The output for each block (written from a single thread) is the sum. The cute trick here is that you don’t have to worry about carries, since the block size is less than 256.

You can then use a parallel sum kernel for the second pass (you have to unpack the results of the first pass, and calculate the number of 4’s based on the block size minus the number of values in the range 0-3). The output of the second pass is a simple array [0…4] of counts for each block.

As needed, recursively run more parallel add routines to sum up the results and produce additional count arrays per block.

Just to add some information that i probably forgot

my usual values range from 0…1,000,000 (1 million). It can get worse off course :P

And on some real array, that I showed second (but gave a poor explanation) it works like that:

[ 1 1 2 2 3 4 0 0 2 2 3 1 1 4]

There are 3 subsets of the array:

1 1 2 2 3 4

0 0 2 2 3

1 1 4

As you can see, in each subset, there will always be the same # of elements if they happen on 2 different subsets, aka, the number 1 is appearing on subset #1 and #3, in which they both appear twice, the same applies to the number 4, and number 2.

The result of the kernel I’m looking for will be this (applied to the real array I just listed):

index: 0 1 2 3 4
Â Â [ 2 2 2 1 1 ]

I think the kernel can be applied to each subset without a problem, so I’m not worried much about this (since it could be easily extended too). I’m more worried about the size of the input array.

I liked the histogram idea, which I’ll take a look and see if it works for me. How fast is the histogram btw?