Hello,
I would like to use CUDA for enumerating combinations and permutations. Is there any sample code, or paper for parallel algorithm for enumerating combinations ?
Many thanks
Mirko
Hello,
I would like to use CUDA for enumerating combinations and permutations. Is there any sample code, or paper for parallel algorithm for enumerating combinations ?
Many thanks
Mirko
Hello,
I would like to use CUDA for enumerating combinations and permutations. Is there any sample code, or paper for parallel algorithm for enumerating combinations ?
Many thanks
Mirko
I am also interested in this topic. Did someone already implemented a backtracking algo for these
kind of problems?.
Thanks.
I have solved this problem. Will post code later today.
It required me to refresh my math skills first :)
This book gave me great insight and solution
following code is essence for calculation - it is kernel and two helper functions
BinCoef - calculates Binomial Coefficient
NUnrank - calculates i-th combination for N over K
Kernel Parameters
int nbets, SBet *pbets, int nlegids, SLegIDs *plegids, SLegIDs *output, int n, int k, char *pComb)
ignore all parameters except
int n
int k
char *pComb
other parameters are for my project - feel free to ignore them :)
n and k are N over K
pComb is output array for all combinations.
size of pComb is calculated = total_number_of_combinations_N_over_K * K
each generated combination is therefore K bytes long, each byte represents index of element in combination
ie. for 100 over 5 - total number of combinations is 75 287 520, and pComb size is 75287520 * 5 = 376437600 (~359MB) bytes for pComb.
#include "betcalc_kernel.cuh"
__inline__ __device__ unsigned long long BinCoef(int n, int r)
{
int i;
unsigned long long b;
if ((r<0) || (n < r)) return(0);
if ((2*r) > n) r = n-r;
b= 1;
if (r > 0)
for(i=0;i<=(r-1);i=i+1)
b= (b * (n-i))/(i+1);
return(b);
}
//calculate RANK - n over k
__device__ int NUnrank(unsigned long long r, char* T, int n, int k) {
int x,i;
unsigned long long y;
x = 1;
for(i=1;i<=k;i=i+1)
{
y = BinCoef(n-x,k-i);
while (y <= r)
{
r = r - y;
x = x+1;
if (x > n)
return 0;
y= BinCoef(n-x,k-i);
}
T[i-1] =x;
x = x + 1;
}
return 1;
}
__global__ void
betcalc_kernel_1(int nbets, SBet *pbets, int nlegids, SLegIDs *plegids, SLegIDs *output, int n, int k, char *pComb)
{
int nTotalThreads = gridDim.x * gridDim.y * blockDim.x * blockDim.y;
int tid = blockIdx.x * blockDim.x + threadIdx.x;
unsigned long long ncomb = BinCoef(n,k);
//each thread will generate some number of combinations (named chunk) - since this is integer division, I will add 1 to fix the result
int chunk = ncomb/nTotalThreads + 1,
//each thread will store data in pComb, it will start from its offset, each combinations consists of k bytes so totaln offset is k*chunk*tid
char *comb = pComb + tid * k * chunk;
//generate all combinations in given range
for (int i = tid * chunk; i < (chunk*(tid + 1)) && i < ncomb; i++)
{
//generate i-th combination n over k and store it in pComb
NUnrank(i, comb, n,k);
//move offset for next combination - k bytes
comb += k;
}
//wait for all threads to finish
syncthreads();
}
Now few results for 100 over 5
TESLA C1060
Data preparation time: 60.769 ms
Processing time: 0.076 ms
Fetching results time: 26355.969 ms
FERMI GTX 470
Data preparation time: 182.569 ms
Processing time: 0.074 ms
Fetching results time: 13116.468 ms
explanation:
Data preparation phase - allocation of all data for device
Processing time phase - actual computation
Fetching results phase - transfer from GPU to HOST (pComb to HOST)
What bother me is that GTX 470 is three times slower in transfer from HOST to GPU.
Tesla C1060 is two times slower in GPU to HOST transfer.
Calculation is almost identical.
Can someone comment this results. I would also like you to try this on C2xxx Teslas.
Complete project (VS 10) is attached to this post.
I hope that this post will be helpful to someone :)
best regards
Mirko
[This file was removed because it was flagged as potentially malicious] (371 KB)