How to implement a recursive combination finder in CUDA or OpenCL for large datasets?

I have a problem involving several thousand lists of integers, each containing thousands of elements. My goal is to find a combination of elements (one from each list) such that the method IsCombinationValid returns true. Here’s how the validation works:

  • The method IsCombinationValid accepts the list of chosen values and checks if they form a valid combination.
  • If the input (e.g., x, y, z) is invalid, all combinations starting with that input are also invalid, so they can be skipped.

Currently, I am using a recursive approach in C++. I choose one element from a list, check it along with the previously chosen values, and if invalid, move to the next element. If all elements in a list are invalid, I backtrack to the previous list and try its next element, and so on. Here’s my C++ code:

void Finder(uint32_t threadIdx, uint32_t firstListStartElementIdx, uint32_t firstListEndElementIdx, std::vector<std::vector<uint32_t>> &lists) {
    std::vector<uint32_t> listsCurrentElementIdx;
    listsCurrentElementIdx.push_back(firstListStartElementIdx);
    size_t firstListElementCount = lists[0].size();

CheckCurrentList:;
    if (isFound) return;

    uint32_t currentListIdx = listsCurrentElementIdx.size() - 1;

    if (currentListIdx == 0) {
        if (listsCurrentElementIdx.back() >= firstListEndElementIdx) {
            printf("### failed : threadIdx %u\n", threadIdx);
            return;
        } else {
            listsCurrentElementIdx.push_back(0);
            goto CheckCurrentList;
        }
    }

    std::vector<uint32_t> &currentList = lists[currentListIdx];
    uint32_t &currentListElementIdx = listsCurrentElementIdx.back();

    while (currentListElementIdx < currentList.size()) {
        if (!IsCombinationValid(lists, listsCurrentElementIdx)) goto CurrentListElementInvalid;

        // Valid combination found
        if (lists.size() == listsCurrentElementIdx.size()) {
            isFound = 1;
            // Print combination
            return;
        }
        listsCurrentElementIdx.push_back(0);
        goto CheckCurrentList;

    CurrentListElementInvalid:;
        ++currentListElementIdx;
    }

    listsCurrentElementIdx.pop_back();
    listsCurrentElementIdx.back()++;

    goto CheckCurrentList;
}

In this code, the first list is divided among threads. This approach works but is infeasible for large lists, even with multithreading on a CPU. I want to implement this in CUDA or OpenCL to leverage GPU parallelism, but I am new to GPU programming.

Problem:

I don’t know how to efficiently divide the work among GPU threads for this problem:

  1. Each thread might need to perform backtracking independently, and some threads may finish early without finding a valid combination, leading to idle threads.

  2. Synchronizing the work and sharing partial results across threads seems complex in this recursive, backtracking problem.

Could someone guide me on how to structure this problem for CUDA or OpenCL? Are there any examples or patterns I can follow for similar recursive problems that involve early termination and backtracking? Any advice on dividing the workload and handling idle threads would also be helpful.

So if you have a partial list (just the first few elements) the function either says invalid and you can skip the beginning or valid and you can create longer lists?

What is the expected number of 1-element partial lists, 2-element partial lists, …

Can those partial hits be stored somewhere or are there too many and they have to be tested on the fly?

If you could only determine a valid combination with a full list, you would have to try out perhaps 5000^3000 combinations, which is quite a huge number, even for GPUs and probably even for compute clusters, even if the test per combination runs reasonably fast.

Those problems are often solved by heuristic approaches like genetic algorithms, instead.

If you give a few more numbers to convince that backtracking is feasible and what sizes we are talking about, we can discuss approaches, how to do those on GPUs.

You can also look into chess engines, how they achieve the evaluation of deeper trees in sensible time, which is a similar problem (but giving an evaluation for each position instead of a binary result):

  • Alpha-Beta Pruning
  • Iterative Deepening
  • Move Ordering
  • Monte Carlo Tree Search
  • Neural Networks

If only very few partial lists survive, I would first make a run calculating all valid partial lists with length 1, store them. Then all partial lists with length 2, … If those are only few lists, then storing and loading them does not take much time. Then you would need no back-tracking, but only attach a single element to those loaded lists and test, whether it is valid.

Thanks for your reply. In my simple tests on CPU using small inputs increasing number of threads result in faster finding time. Most threads don’t go further than some lists and backtrack. I just need to have hundreds of thousands of threads and I think it’s possible using several GPUs. Only implementing and optimizing it in cuda is a problem for me.

What overall running time are you considering? Microseconds, Milliseconds, seconds or longer?

If it is not overall short, I would start as written (as it is simple and perhaps efficient for not too short running times):

Run the kernels in a loop with ever increasing partial list lengths.
The first kernel invocation finds valid lists with length 1, the second kernel invocation with length 2, …
As result, the valid lists of that length are stored and reloaded from the next kernel.

Each thread block could have its own input + output store (in global memory) to reduce synchronization between thread blocks.

To know, where to store valid results, you would use an index in shared memory, which is atomically increased.

That more simple approach could be a baseline for more sophisticated GPU algorithms using back tracking, shared memory, …

I need it to be finished in some minutes e.g 10 or so minutes. Is there any sample of the algorithm you mentioned?

I can give you some overview.

Something like this (not tested):

int* list0;
int* list1;
int* size0;
int* size1;

#define NO_SMS 200
cudaMalloc(&list0, NO_SMS * 1048576);
cudaMalloc(&list1, NO_SMS * 1048576);
cudaMalloc(&size0, 4);
cudaMalloc(&size1, 4);

for (int i = 0; i < 3000; i++) { // in each iteration, we twice test combinations one longer
    kernel<<<NO_SMS, 256>>>(list0, size0, i, table[i], &tablelen[i], list1, size1); // list0: input, list1: output
    i++;
    kernel<<<NO_SMS, 256>>>(list1, size1, i, table[i], &tablelen[i], list0, size0); // list1: input, list0: output
}

/*
kernel takes partial valid lists of length "length" and builds and tests lists of length "length + 1"

int* input: pointer to array with valid combinations one shorter. Each block has 1048576 (private) integers.
int* inputamount: How many partial lists are used for each block
int length: length validated so far
int* table: single table with new integers
int* tablelength: amount of new integers
int* output: pointer to array with output combinations. Each block has 1048576 (private) integers
int* outputamount: How many are stored for each block
*/
__global__ kernel(int* input, int* inputamount, int length, int* table, int* tablelength, int* output, int* outputamount)
{
    __shared__ int outputidx; // block-wide atomic variable for index in output array
    if (threadIdx.x == 0)
        outputidx = 0;

    __syncthreads();

    int iamount = inputamount[blockIdx.x];
    int tlength = *tablelength;
    for (int i = threadIdx.x; i < iamount; i += blockDim.x) { // grid-stride loop over inputs one shorter
        for (int j = 0; j < tlength; j++) {                                      // loop over new integers
            int* iptr = &input[blockIdx.x * 1048576 + i * length]; // input one shorter
            bool valid = validate(iptr, table[j]);                           // validate with new integer
            if (valid) {
                int oindex = atomicAdd(&outputidx, 1);             // output index
                int* optr = &output[blockIdx.x * 1048576 + oindex * (length + 1)]; // output pointer
                for (int k = 0; k < length; k++)
                    *(optr++) = *(iptr++);                                          // copy one shorter to output
                *optr = table[j];                                                       // append new integer
            }
        }
    }

    __syncthreads();

    outputamount[blockIdx.x] = outputidx;
}

You have to insert some special handling for length == 0 into the kernel or write a separate kernel for length 0.

1 Like

Thank you very much. I’m going to test it.

I hope, everything is understandable for you. It is more of an example code for the conceptual working than the actual code.

1 Like