Reduce pair set to unique entries...

For the sake of example, assume 6 threads and each thread is trying to write a pair (to where will be explained) like this, for example :

(0, 1) - thread 0
(0, 5) - thread 1
(1, 2) - thread 2
(2, 3) - thread 3
(3, 4) - thread 4
(4, 5) - thread 5

This can easily be written to a global bank without conflicts but that’s not quite what I want. I want to reduce the set so that each remaining pair has completely unique entries.

There is more than one permutation but it’s fine so long as the output follows the same rules.

Here’s one sample output :
(0, 1)
(2, 3)
(4, 5)

And here’s another :
(0, 5)
(1, 2)
(3, 4)

It’s all dependent upon which initial indices I say are unique.

This problem would be simple to brute-force on a CPU but on a GPU, it seems much more difficult. I’ve thought about modifying the write procedure to check for uniqueness but that sounds incredibly cumbersome.

How should I try to attack this problem?

ACTUALLY this won’t work, ignore for now.

  1. Assign a pair to each thread.

  2. Start with zero (or the first index).

  3. Amongst all threads that have a pair containing this index, only one wins (perhaps use atomics) and gets to place their pair in the result. Threads that don’t have a pair containing this index are inactive for this cycle (and in fact can race ahead to the next index to be tested/balloted).

  4. All threads that participated in this ballot now become inactive for the remainder of the cycles (i.e. they can exit).

  5. select the next index, and go to step 3, until there are no more indices

This method would require a ballot-box (atomic location) per index. It would also require a separate (atomic) location to keep track of the ordering of pairs making their way to the output.

Validating the results, regardless of the method you choose, may be harder than the algorithm itself.

I like your idea.

I was also thinking of a hash function, originally.

So, if I follow your example and start with the first index as the anchor, I can easily choose a point at random. Once selected, I can just write that point to the output.

But I’m having trouble ensuring that the second element of the pair isn’t repeated. I feel like if I keep everything sorted, use the atomicMin() function and the < operator, I can do it though. Maybe.

Yes, that was what I discovered about the sequence I outlined. It does not guarantee that the second element of the pair doesn’t get repeated.

Alright, this was the best I could come up with. I hope it’s easy to understand. Let me know what you think.

Compile with :

nvcc -gencode arch=compute_50,code=sm_50 -rdc=true -o test test.cu
#include <stdio.h>
#include <assert.h>

enum color { white, blue, red };

struct pair
{
    int first, second;

    __host__ __device__
    pair(int a, int b) : first(a), second(b) { };
};

struct args
{
    pair *pr;
    int *color; // color table
    bool *tbl; // hash table
    int *anchor;
};

__global__
void child(pair *pr, int *color, bool *tbl, int *anchor)
{
    int i = threadIdx.x + blockIdx.x * blockDim.x;

    if (i < 6)
    {    
        if (color[i] == blue || color[i] == red)
            return;

        if (tbl[pr[i].first] || tbl[pr[i].second])
        {
            //printf("%d, %d\n", pr[i].first, pr[i].second);
            color[i] = blue;
            return;
        }

        atomicMin(anchor, i);
    }
}

__global__
void parent(args data)
{
    pair *pr = data.pr;
    int *color = data.color;
    bool *tbl = data.tbl;
    int *anchor = data.anchor;

    do
    {
        tbl[pr[*anchor].first] = 1;
        tbl[pr[*anchor].second] = 1;
        color[*anchor] = red;
        *anchor = INT_MAX;
        child<<<1, 6>>>(pr, color, tbl, anchor);
        cudaDeviceSynchronize();

        //printf("New anchor %d\n", *anchor);
    } while (*anchor != INT_MAX);

    for (int i = 0; i < 6; ++i)
        if (color[i] == red)
            printf("%d, %d\n", pr[i].first, pr[i].second);
}

int main(void)
{
    pair *data = 0;
    int *color = 0;
    bool *tbl = 0;
    int *anchor = 0;

    cudaMallocManaged(&data, 6 * sizeof(*data));
    cudaMallocManaged(&color, 6 * sizeof(*color));
    cudaMallocManaged(&tbl, 6 * sizeof(*tbl));
    cudaMallocManaged(&anchor, sizeof(*anchor));

    (data + 0)->first  = 0;
    (data + 0)->second = 1;

    (data + 1)->first  = 0;
    (data + 1)->second = 5;

    (data + 2)->first  = 1;
    (data + 2)->second = 2;

    (data + 3)->first  = 2;
    (data + 3)->second = 3;

    (data + 4)->first  = 3;
    (data + 4)->second = 4;

    (data + 5)->first  = 4;
    (data + 5)->second = 5;

    for (int i = 0; i < 6; ++i)
        color[i] = white;

    cudaMemset(tbl, 0, 6 * sizeof(*tbl));
    *anchor = 0;

    args dta;
    dta.pr = data;
    dta.tbl = tbl;
    dta.color = color;
    dta.anchor = anchor;

    parent<<<1, 1>>>(dta);
    cudaDeviceSynchronize();

    cudaFree(anchor);
    cudaFree(tbl);
    cudaFree(color);
    cudaFree(data);

    return 0;
}