How can this combination algorithm be modified to run in parallel on a cuda enabled gpu?

I currently have a c program that generates all possible combinations of a character string. Please note combinations, not permutations. This is the program:

#include <stdio.h>
    #include <string.h>
    #include <stdlib.h>

    //Constants
    static const char set[] = "abcd";
    static const int setSize = sizeof(set) - 1;

     void brute(char* temp, int index, int max){
         //Declarations
         int i;
         for(i = 0; i < setSize; i++){
             temp[index] = set[i];
             if(index == max - 1){
                    printf("%s\n", temp);
             }
             else{
                  brute(temp, index + 1, max);
             }
       }
    }

    void length(int max_len){
       //Declarations
       char* temp = (char *) malloc(max_len + 1);
       int i;
     
       //Execute
       for(i = 1; i <= max_len; i++){
             memset(temp, 0, max_len + 1);
             brute(temp, 0, i);
       }
       free(temp);
    }

int main(void){
       //Execute
       length(2);
       getchar();
       return 0;
    }

The maximum length of the combinations can be modified; it is currently set to 2 for demonstration purposes. Given how it’s currently configured, the program outputs

a, b, c, d, aa, ab, ac, ad, ba, bb, bc, bd, ca, cb, cc...

I’ve managed to translate this program into cuda:

#include "cuda_runtime.h"
    #include "device_launch_parameters.h"

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>

    //On-Device Memory
    __constant__ char set_d[] = "adcd";
    __constant__ int setSize_d = 4;

    __device__ void brute(char* temp, int index, int max){
	  //Declarations 
	  int i;
	  for(i = 0; i < setSize_d; i++){
		  temp[index] = set_d[i];
		  if(index == max - 1){
			  printf("%s\n", temp);
		  }
		  else{
			  brute(temp, index + 1, max);
		  }
	  }
    }

__global__ void length_d(int max_len){
    
          //Declarations
	  char* temp = (char *) malloc(max_len + 1);
	  int i;
	  //Execute
	  for(i = 1; i <= max_len; i++){
		  memset(temp, 0, max_len+1);
		  brute(temp, 0, i);
	  }
	  free(temp);
    }

int main()
    {
        //Execute
        cudaSetDevice(0);
        
        //Launch Kernel
	    length_d<<<1, 1>>>(2);
	    cudaDeviceSynchronize();

	    getchar(); //Keep this console open...
	    return 0;
    }

The cuda version of the original program is basically an exact copy of the c program (note that it is being compiled with -arch=sm_20. Therefore, printf and other host functions work in the cuda environment).

My goal is to compute combinations of a-z, A-Z, 0-9, and other characters of maximum lengths up to 10. That being the case, I want this program to run on my gpu. As it is now, it does not take advantage of parallel processing - which obviously defeats the whole purpose of writing the program in cuda. However, I’m not sure how to remove the recursive nature of the program in addition to delegating the threads to a specific index or starting point.

Any constructive input is appreciated.

Also, I get an occasional warning message on successive compiles (meaning it sporadically appears): warning : Stack size for entry function '_Z8length_di' cannot be statically determined.

I haven’t pinpointed the problem yet, but I figured I would post it in case anyone identified the cause before I can. It is being compiled in visual studio 2012.

Note: I found this to be fairly interesting. As the cuda program is now, its output to the console is periodic - meaning that it prints a few dozen combinations, pauses, prints a few dozen combinations, pauses, and so forth. I also observe this behavior in its reported gpu usage - it periodically swings from 5% to 100%.

I’m not certain why my code is formatted the way it appears - when I go to edit it, the tabs are as they should be. I hope it won’t be an eyesore…

To preserve formatting, enclose code within code tags, i.e. [ code ] [ /code ] (remove the spaces).

One approach to enumerating permutations in lexical order is the use of factoradic. This makes it possible to generated the kth permutation directly, and thus is easily parallelized. There could well be more efficient ways to generate permutations in parallel (this is not my area of expertise).

If you search for “combinadic” in Wikipedia, you will find an article that describes a similar number system, but for combinations rather than permutations. That could be a starting point for the parallel generation of combinations, but I have not taken a closer look.

http://en.wikipedia.org/wiki/Combinadic

I originally started working on a factoradic method, but I did not need permutations. However, I wasn’t aware of the other method. I’ll check it out. Thanks.

I have posted my code for exactly these two things more than once:

combinations:

https://github.com/OlegKonings/CUDA_ALL_SUBSETS

permutations:

https://github.com/OlegKonings/CUDA_next_permutation

The approach you are trying to use in your device functions is not going to work.

Think BFS not DFS. Recursion really is not the way to use device functions, think iterative.

That is exactly my point - GPUs aren’t designed with recursion in mind. I will take a look at your code. Thanks.