bsxfun : Any cuda implementations?

I’m wondering if anyone has coded bsxfun and wishes to share it :)

Basically bsxfun is matlab’s way of performing the same action to multiple data sets - a generalization of diagonal matrix operations. For example:

Adding/subtracting/multiplying/dividing/maximum/minimum of an m1 vector with each column of an mn matrix

Adding/subtracting/multiplying/dividing/maximum/minimum of an 1n row vector with each row of an mn matrix

There are further functions that can be applied between two elements obviously - power, modulus,logical operations spring to mind - but that’s the easy part. This is one of the basic building blocks of linear algebra, and is particularly useful for sparse linear algebra (whereby diagonal matrices are stored as vectors - taking the inverse then is simply V./D for some vector-to-solve V)

bump! I think this is a useful piece of code to get working

For adding/subtracting an m*1 vector to each colunm or row , BLAS2 GER ( A = A + alpha.x.y’ ) can be used. One of the vector has to be filled with one. The only caveat is to construct the vector of 1. Unfortunately, standard GER does not support incx or incy = 0 which could have allowed to create only one value 1.

For multiplying, cublasdgmm can be used in order to scale every colunm or row of a matrix

Dividing is probably the most important - it is used to multiply by the inverse of a diagonal matrix, for example. Any ideas on how you might do that? :)

To do the division, you can invert the vector first and then use cublasdgmm with the inverted vector as the diagonal

Here some templated code to invert a vector :

template
global void invert_vector( T_ELEM *Vector , int n)
{
unsigned int i, n, tid, totalThreads, ctaStart;

one = cuGet<T_ELEM>(1);

tid = threadIdx.x;
totalThreads = gridDim.x * blockDim.x;
ctaStart = blockDim.x * blockIdx.x;

/* increment equal to 1 */
for (i = ctaStart + tid; i < n; i += totalThreads) {
    Vector[i] = cuDiv( one, Vector[i];        
}

}

with :
/* various inline device function to initialize a T_ELEM */
template inline device host T_ELEM cuGet (int );
template <> inline device host float cuGet(int x)
{
return float(x);
}

template <> inline device host double cuGet(int x)
{
return double(x);
}

template <> inline device host cuComplex cuGet(int x)
{
return (make_cuComplex( float(x), 0.0f ));
}

template <> inline device host cuDoubleComplex cuGet(int x)
{
return (make_cuDoubleComplex( double(x), 0.0 ));
}

/* Division */
static inline device host float cuDiv( float x , float y )
{
return (x / y);
}

static inline device host double cuDiv( double x , double y )
{
return (x / y);
}

static inline device host cuComplex cuDiv( cuComplex x , cuComplex y )
{
return( cuCdivf( x, y ) );
}

static inline device host cuDoubleComplex cuDiv( cuDoubleComplex x , cuDoubleComplex y )
{
return( cuCdiv( x, y ) );
}