I am newbie to CUDA. Here is a problem I got recently.
I have a vector v and matrix m in device memory. The length of v is dim, and m is a square matrix dim x dim.
Here is the pseudo code I want to write with CUDA
[codebox]for(int i=0; i<dim; ++i) {
if(v[i] == 0) {
set the column and row i in matrix m to 0 except the diagonal element.
}
}[/codebox]
Here is the CUDA code I wrote to implement this:
[codebox]global void zero_row(float* matrix, int dim, int i)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
if(x>=dim)
return;
else
matrix[i+x*dim] = 0;
}
global void zero_column(float* matrix, int dim, int i)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
if(x>=dim)
return;
else
matrix[i*dim+x] = 0;
}
global void make_matrix(const float* v, int dim, float* matrix)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
if(x >= dim)
return;
if(v==0) {
float temp = matrix[x*dim+x];
zero_row<<<(dim+nthreads-1)/nthreads, nthreads>>>(matrix, dim, x);
zero_column<<<(dim+nthreads-1)/nthreads, nthreads>>>(matrix, dim, x);
matrix[x*dim+x] = temp;
}
}
make_matrix<<<(dim+nthreads-1)/nthreads, nthreads>>>(x,dim,matrix);
[/codebox]
The compiler gave me an error said that
“calling a global function from a global function is not allowed”.
One method is to work around this is to make functions zero_row() and zero_column() a device function,
but then I will not be able to set the rows and columns in parallel. Could someone give some suggestions
about how to solve this problem? Thanks a lot!