binomial coefficient (function or formula or CUBLAS)

local_hero · December 6, 2011, 11:08am

Hi people, exist in CUDA (or in the library CUBLAS) a function or formula for the calculation of binomial coefficient? I need calculate the binomial coefficient of N on 2. I have only the version for CPU:

int coefficienteBinomiale(int n, int k){
if (k==0 || k==n)
return 1;
return coefficienteBinomiale(n-1,k-1)+coefficienteBinomiale(n-1,k);
}

but not that for CUDA :(. Thanks a lot!

cmaster.matso · December 6, 2011, 11:49am

Hi people, exist in CUDA (or in the library CUBLAS) a function or formula for the calculation of binomial coefficient? I need calculate the binomial coefficient of N on 2. I have only the version for CPU:

int coefficienteBinomiale(int n, int k){
if (k==0 || k==n)

	return 1;

return coefficienteBinomiale(n-1,k-1)+coefficienteBinomiale(n-1,k);
}

but not that for CUDA :(. Thanks a lot!

What would be Your application for this formula? Do You want the coefficent to be calculated per thread or by a pack of them?

Regrads,

MK

local_hero · December 6, 2011, 12:00pm

what is best. Example: N=20 —> Result=190, then N!/(2!*(N-2)!)

cmaster.matso · December 6, 2011, 1:26pm

Hope the code below would be helpful in Your case:

__device__ int fastCoeffBinomialN2(int N, int *cache)	// initialize the 'cache' to arbitraty length with zeros

{

	//cache[0] = 0; // this can be done once, outside the function

	//cache[1] = 0;

	//cache[2] = 1; // this essential - the method will not work should this is not set

	if (cache[N] > 0) return cache[N];		// this appears when for the N the function has been called once before

	return (N > 2) ? (cache[N] = (N-1) + cache[N-1]) : cache[N];

}

The function is speciefied for calculating the coefficients of N on 2 only. It uses caching to store inportant data and execute faster.

Regards,

MK

local_hero · December 6, 2011, 1:42pm

Hope the code below would be helpful in Your case:
__device__ int fastCoeffBinomialN2(int N, int *cache)	// initialize the 'cache' to arbitraty length with zeros

{

	//cache[0] = 0; // this can be done once, outside the function

	//cache[1] = 0;

	//cache[2] = 1; // this essential - initialize the 

	if (cache[N] > 0) return cache[N];		// this appears when for the N the function has been called once before

	return (N > 2) ? (cache[N] = (N-1) + cache[N-1]) : cache[N];

}
The function is speciefied for calculating the coefficients for N on 2 only. It uses caching to store inportant data and execute faster.

Regards,

MK

Sorry, but I don’t understand what is the array *cache. Then, must the function be called by N threads? Thanks a lot!

cmaster.matso · December 6, 2011, 2:36pm

The function is a device code, thus it can be called in whatever number of thread You setup. The formula is simple enough, that one thread can be used for single coefficient calculation.

The ‘cache’ is an array of int’s initialized with zeros. It can be separate for each thread or global/shared (note the race condition possibilities in such cases). It stores coefficients from the Pascal’s triangle for the second binomial parameter set to 2.

The ‘N’ refers to the N from the binomial (N on 2).

Also an essential thing is to initialize cache[2] with value 1 (I forgot to finish the comment External Image). The initialization can be a first action done within the kernel or, in case of cache being a global array, a host code - set the host array ‘c’ to zeros and ‘c[2]’ to 1, then pass it as kernel parameter:

some_kernel_function<<<nblocks, number_of_threads_not_need_to_be_equal_to_N>>>(c);

and do Your calculations using function ‘fastCoeffBinomialN2’.

Regrads,

MK

P.S.

Of course I forgot to add that the code of the ‘fastCoeffBinomialN2’ refers to SINGLE ITERATION of the method. Sorry for my misleading You. The correct code of the function is likewise:

__device__ int fastNewtonBinomialN2(int N, int cache[])

{

	int i = 2;

	while (i <= N) {

		if (cache[N] > 0) return cache[N];

		cache[i] = (i-1) + cache[i-1];

		++i;

	}

	return cache[N]; 

}

Bless You!

local_hero · December 7, 2011, 9:20am

The function is a device code, thus it can be called in whatever number of thread You setup. The formula is simple enough, that one thread can be used for single coefficient calculation.

The ‘cache’ is an array of int’s initialized with zeros. It can be separate for each thread or global/shared (note the race condition possibilities in such cases). It stores coefficients from the Pascal’s triangle for the second binomial parameter set to 2.

The ‘N’ refers to the N from the binomial (N on 2).

Also an essential thing is to initialize cache[2] with value 1 (I forgot to finish the comment External Image). The initialization can be a first action done within the kernel or, in case of cache being a global array, a host code - set the host array ‘c’ to zeros and ‘c[2]’ to 1, then pass it as kernel parameter:
some_kernel_function<<<nblocks, number_of_threads_not_need_to_be_equal_to_N>>>(c);
and do Your calculations using function ‘fastCoeffBinomialN2’.

Regrads,

MK

P.S.

Of course I forgot to add that the code of the ‘fastCoeffBinomialN2’ refers to SINGLE ITERATION of the method. Sorry for my misleading You. The correct code of the function is likewise:
__device__ int fastNewtonBinomialN2(int N, int cache[])

{

	int i = 2;

	while (i <= N) {

		if (cache[N] > 0) return cache[N];

		cache[i] = (i-1) + cache[i-1];

		++i;

	}

	return cache[N]; 

}
Bless You!

But, in your opinion, this solution is more faster than CPU?

cmaster.matso · December 7, 2011, 12:18pm

First of all my method does not use recursion. Second, it calculates the coefficients only for N on 2, not the whole Pascal’s triangle. Furthermore, once You have called the function for given N, when calling for it again the result is given immediately from the cache. Thus it should be faster, but I have not tried it out.

Another thing comes to my mind that You could fill the cache with the coeffs values up to a arbitrary max N. Then only thing You need to do within kernel is to read the cache for given N. The values of the binomial for the case You are intrested in are known and thus can be stored as kernel parameter not calculated online. The new ‘fastNewtonBinomialN2’ would look likewise:

__device__ int fastNewtonBinomialN2(int N, int cache[])

{

  // given cache is an array as follows: int cache[] = { 1, 3, 6, 10, 15, ... and so on

  return (N >= 2) ? cache[N-2] : 0; // correct me if I'm wrong, please...

}

This should be much faster.

Regards,

MK