Enumerating combinations

Hello,

I would like to use CUDA for enumerating combinations and permutations. Is there any sample code, or paper for parallel algorithm for enumerating combinations ?
Many thanks

Mirko

Hello,

I would like to use CUDA for enumerating combinations and permutations. Is there any sample code, or paper for parallel algorithm for enumerating combinations ?
Many thanks

Mirko

I am also interested in this topic. Did someone already implemented a backtracking algo for these
kind of problems?.

Thanks.

I have solved this problem. Will post code later today.

It required me to refresh my math skills first :)

This book gave me great insight and solution

following code is essence for calculation - it is kernel and two helper functions

  1. BinCoef - calculates Binomial Coefficient

  2. NUnrank - calculates i-th combination for N over K

Kernel Parameters

int nbets, SBet *pbets, int nlegids, SLegIDs *plegids,  SLegIDs *output, int n, int k, char *pComb)

ignore all parameters except

int n

int k

char *pComb

other parameters are for my project - feel free to ignore them :)

n and k are N over K

pComb is output array for all combinations.

size of pComb is calculated = total_number_of_combinations_N_over_K * K

each generated combination is therefore K bytes long, each byte represents index of element in combination

ie. for 100 over 5 - total number of combinations is 75 287 520, and pComb size is 75287520 * 5 = 376437600 (~359MB) bytes for pComb.

#include "betcalc_kernel.cuh"

__inline__ __device__ unsigned long long BinCoef(int n, int r)

{

  int i;

unsigned long long b;

if ((r<0) || (n < r)) return(0);

if ((2*r) > n) r = n-r;

b= 1;

  if (r > 0)

   for(i=0;i<=(r-1);i=i+1)

     b= (b * (n-i))/(i+1);

return(b);

}

//calculate RANK - n over k

__device__ int NUnrank(unsigned long long r, char* T, int n, int k) { 

 int x,i;  

 unsigned long long y;  

 x = 1;

 for(i=1;i<=k;i=i+1)

 {

   y = BinCoef(n-x,k-i);

   while (y <= r)

   {

     r = r - y;

     x = x+1;

	 if (x > n)

		 return 0;

     y= BinCoef(n-x,k-i);

   }

     T[i-1] =x;

     x = x + 1;

  }

  return 1;

}

__global__ void

betcalc_kernel_1(int nbets, SBet *pbets, int nlegids, SLegIDs *plegids,  SLegIDs *output, int n, int k, char *pComb)

{

	int nTotalThreads = gridDim.x * gridDim.y * blockDim.x * blockDim.y;

	int tid = blockIdx.x * blockDim.x + threadIdx.x;

			

	unsigned long long ncomb = BinCoef(n,k);

	//each thread will generate some number of combinations (named chunk) - since this is integer division, I will add 1 to fix the result

	int chunk = ncomb/nTotalThreads + 1,

	//each thread will store data in pComb, it will start from its offset, each combinations consists of k bytes so totaln offset is k*chunk*tid

	char *comb = pComb + tid * k * chunk;

	

	//generate all combinations in given range

	for (int i = tid * chunk; i < (chunk*(tid + 1)) && i < ncomb; i++)

	{

		//generate i-th combination n over k and store it in pComb

		NUnrank(i, comb, n,k);

		//move offset for next combination - k bytes

		comb += k;

	}

	//wait for all threads to finish

	syncthreads();

}

Now few results for 100 over 5

TESLA C1060

Data preparation time: 60.769 ms

Processing time: 0.076 ms

Fetching results time: 26355.969 ms

FERMI GTX 470

Data preparation time: 182.569 ms

Processing time: 0.074 ms

Fetching results time: 13116.468 ms

explanation:

Data preparation phase - allocation of all data for device

Processing time phase - actual computation

Fetching results phase - transfer from GPU to HOST (pComb to HOST)

What bother me is that GTX 470 is three times slower in transfer from HOST to GPU.

Tesla C1060 is two times slower in GPU to HOST transfer.

Calculation is almost identical.

Can someone comment this results. I would also like you to try this on C2xxx Teslas.

Complete project (VS 10) is attached to this post.

I hope that this post will be helpful to someone :)

best regards

Mirko
[This file was removed because it was flagged as potentially malicious] (371 KB)