Matrix help

fender177 · September 28, 2010, 1:06am

Hi there,

I have an interesting problem. I am trying to implement bifactorization on the GPU, which requires me to multiply a large sparse matrix by a vector. The matrix, in this case, is around 6000x6000, and the vector is 6000 elements long. Where this gets a little interesting is: I have to step through the columns of the matrix from 0-6000. Each column must be multiplied by the vector, which changes one element of the vector (the element whose number corresponds to the matrix column). Since CUDA must be able to execute any thread block in any given order to achieve high optimization, I can’t really do this easily.

The only way I can think to do this is call the kernel with, say, 512 blocks and 1 thread per block. Terribly inefficient. In fact, this whole process is terribly inefficient, given that we can only multiply by one column of the matrix at a time.

Anyone, have a suggestion on how I should launch my kernel, as well as access just one column of the matrix at a time?

My current CUDA code looks like this:

__global__ void Matrix(double* a, double* b, int row){

	int i = blockIdx.x*blockDim.x+threadIdx.x;

	while(i < N*N){

		if(blockIdx.x==row){

			a[i] = i+.77;

		}

		i += blockDim.x * N;

	}

}

I feel a bit ashamed for posting this code :-/, but it is the only way I can think of to step through the matrix column by column…

Thanks all

fender177 · September 28, 2010, 1:06am

Hi there,

I have an interesting problem. I am trying to implement bifactorization on the GPU, which requires me to multiply a large sparse matrix by a vector. The matrix, in this case, is around 6000x6000, and the vector is 6000 elements long. Where this gets a little interesting is: I have to step through the columns of the matrix from 0-6000. Each column must be multiplied by the vector, which changes one element of the vector (the element whose number corresponds to the matrix column). Since CUDA must be able to execute any thread block in any given order to achieve high optimization, I can’t really do this easily.

The only way I can think to do this is call the kernel with, say, 512 blocks and 1 thread per block. Terribly inefficient. In fact, this whole process is terribly inefficient, given that we can only multiply by one column of the matrix at a time.

Anyone, have a suggestion on how I should launch my kernel, as well as access just one column of the matrix at a time?

My current CUDA code looks like this:

__global__ void Matrix(double* a, double* b, int row){

	int i = blockIdx.x*blockDim.x+threadIdx.x;

	while(i < N*N){

		if(blockIdx.x==row){

			a[i] = i+.77;

		}

		i += blockDim.x * N;

	}

}

I feel a bit ashamed for posting this code :-/, but it is the only way I can think of to step through the matrix column by column…

Thanks all

kbam · September 28, 2010, 1:54am

Hi there,

I have an interesting problem. I am trying to implement bifactorization on the GPU, which requires me to multiply a large sparse matrix by a vector. The matrix, in this case, is around 6000x6000, and the vector is 6000 elements long. Where this gets a little interesting is: I have to step through the columns of the matrix from 0-6000. Each column must be multiplied by the vector, which changes one element of the vector (the element whose number corresponds to the matrix column). Since CUDA must be able to execute any thread block in any given order to achieve high optimization, I can’t really do this easily.

The only way I can think to do this is call the kernel with, say, 512 blocks and 1 thread per block. Terribly inefficient. In fact, this whole process is terribly inefficient, given that we can only multiply by one column of the matrix at a time.

Anyone, have a suggestion on how I should launch my kernel, as well as access just one column of the matrix at a time?

My current CUDA code looks like this:
__global__ void Matrix(double* a, double* b, int row){

	int i = blockIdx.x*blockDim.x+threadIdx.x;

	while(i < N*N){

		if(blockIdx.x==row){

			a[i] = i+.77;

		}

		i += blockDim.x * N;

	}

}
I feel a bit ashamed for posting this code :-/, but it is the only way I can think of to step through the matrix column by column…

Thanks all

Is it possible to have each thread of a block work on a different column ?

Then each thread is keeping track of value of its cell from the vector

NB on each iteration of the loop all threads move to the next row.

So code would be something like

read cell from vector
process cell in matrix
repeat 2) on subsequent rows until all rows done
save cell back to vector

Cheers

kbam

kbam · September 28, 2010, 1:54am

Hi there,

I have an interesting problem. I am trying to implement bifactorization on the GPU, which requires me to multiply a large sparse matrix by a vector. The matrix, in this case, is around 6000x6000, and the vector is 6000 elements long. Where this gets a little interesting is: I have to step through the columns of the matrix from 0-6000. Each column must be multiplied by the vector, which changes one element of the vector (the element whose number corresponds to the matrix column). Since CUDA must be able to execute any thread block in any given order to achieve high optimization, I can’t really do this easily.

The only way I can think to do this is call the kernel with, say, 512 blocks and 1 thread per block. Terribly inefficient. In fact, this whole process is terribly inefficient, given that we can only multiply by one column of the matrix at a time.

Anyone, have a suggestion on how I should launch my kernel, as well as access just one column of the matrix at a time?

My current CUDA code looks like this:
__global__ void Matrix(double* a, double* b, int row){

	int i = blockIdx.x*blockDim.x+threadIdx.x;

	while(i < N*N){

		if(blockIdx.x==row){

			a[i] = i+.77;

		}

		i += blockDim.x * N;

	}

}
I feel a bit ashamed for posting this code :-/, but it is the only way I can think of to step through the matrix column by column…

Thanks all

Is it possible to have each thread of a block work on a different column ?

Then each thread is keeping track of value of its cell from the vector

NB on each iteration of the loop all threads move to the next row.

So code would be something like

read cell from vector
process cell in matrix
repeat 2) on subsequent rows until all rows done
save cell back to vector

Cheers

kbam

kbam · September 28, 2010, 1:54am

Hi there,

I have an interesting problem. I am trying to implement bifactorization on the GPU, which requires me to multiply a large sparse matrix by a vector. The matrix, in this case, is around 6000x6000, and the vector is 6000 elements long. Where this gets a little interesting is: I have to step through the columns of the matrix from 0-6000. Each column must be multiplied by the vector, which changes one element of the vector (the element whose number corresponds to the matrix column). Since CUDA must be able to execute any thread block in any given order to achieve high optimization, I can’t really do this easily.

The only way I can think to do this is call the kernel with, say, 512 blocks and 1 thread per block. Terribly inefficient. In fact, this whole process is terribly inefficient, given that we can only multiply by one column of the matrix at a time.

Anyone, have a suggestion on how I should launch my kernel, as well as access just one column of the matrix at a time?

My current CUDA code looks like this:
__global__ void Matrix(double* a, double* b, int row){

	int i = blockIdx.x*blockDim.x+threadIdx.x;

	while(i < N*N){

		if(blockIdx.x==row){

			a[i] = i+.77;

		}

		i += blockDim.x * N;

	}

}
I feel a bit ashamed for posting this code :-/, but it is the only way I can think of to step through the matrix column by column…

Thanks all

Is it possible to have each thread of a block work on a different column ?

Then each thread is keeping track of value of its cell from the vector

NB on each iteration of the loop all threads move to the next row.

So code would be something like

read cell from vector
process cell in matrix
repeat 2) on subsequent rows until all rows done
save cell back to vector

Cheers

kbam

fender177 · September 28, 2010, 2:07am

Unfortunately, no. After the algorithm is done multiplying the column by the vector, the new value is stored in the vector, and in turn the new value is used in the next iteration.

But you may have just given me an idea. The algorithm actually goes along a diagonal (first under the diagonal, and then above). I need to take another look - if the algorithm is working under the diagonal, then it really shouldn’t worry about the values higher up in the vector.

That may make absolutely no sense, so I apologize. I’ll let you know if I figure out the solution.

Thanks for the quick reply kbam.

fender177 · September 28, 2010, 2:07am

Unfortunately, no. After the algorithm is done multiplying the column by the vector, the new value is stored in the vector, and in turn the new value is used in the next iteration.

But you may have just given me an idea. The algorithm actually goes along a diagonal (first under the diagonal, and then above). I need to take another look - if the algorithm is working under the diagonal, then it really shouldn’t worry about the values higher up in the vector.

That may make absolutely no sense, so I apologize. I’ll let you know if I figure out the solution.

Thanks for the quick reply kbam.

fender177 · September 28, 2010, 2:07am

Unfortunately, no. After the algorithm is done multiplying the column by the vector, the new value is stored in the vector, and in turn the new value is used in the next iteration.

But you may have just given me an idea. The algorithm actually goes along a diagonal (first under the diagonal, and then above). I need to take another look - if the algorithm is working under the diagonal, then it really shouldn’t worry about the values higher up in the vector.

That may make absolutely no sense, so I apologize. I’ll let you know if I figure out the solution.

Thanks for the quick reply kbam.

fender177 · September 28, 2010, 2:24am

After taking another look at the sequential algorithm, it doesn’t look like that will work. The algorithm multiplies the entire column by the entire vector and replaces the vector with the results. It then moves on to the next column of the matrix and multiples that column by the new vector.

So I guess we are back to the original question. How can I iterate through each column of the matrix in a somewhat efficient way?

fender177 · September 28, 2010, 2:24am

After taking another look at the sequential algorithm, it doesn’t look like that will work. The algorithm multiplies the entire column by the entire vector and replaces the vector with the results. It then moves on to the next column of the matrix and multiples that column by the new vector.

So I guess we are back to the original question. How can I iterate through each column of the matrix in a somewhat efficient way?

fender177 · September 28, 2010, 2:24am

After taking another look at the sequential algorithm, it doesn’t look like that will work. The algorithm multiplies the entire column by the entire vector and replaces the vector with the results. It then moves on to the next column of the matrix and multiples that column by the new vector.

So I guess we are back to the original question. How can I iterate through each column of the matrix in a somewhat efficient way?

kbam · September 28, 2010, 6:03am

Is the bifactorizing the only way to get the results you want ?

or is it a technique normally employed to speed up a process that a GPU could do very rapidly some other way

kbam · September 28, 2010, 6:03am

Is the bifactorizing the only way to get the results you want ?

or is it a technique normally employed to speed up a process that a GPU could do very rapidly some other way

kbam · September 28, 2010, 6:03am

Is the bifactorizing the only way to get the results you want ?

or is it a technique normally employed to speed up a process that a GPU could do very rapidly some other way

Topic		Replies	Views
Using atomicAdd to step through an array CUDA Programming and Performance	7	4006	May 24, 2011
Any ideas on GF(2) sparse matrix-vector multiply? CUDA Programming and Performance	5	5939	August 7, 2008
matrix multiply reduction CUDA Programming and Performance	41	35830	January 15, 2011
Matrix Partitioning how would you implement it? Sparse matrix partitioning CUDA Programming and Performance	16	14784	March 30, 2010
Sparse Matrix -> Vector CUDA Programming and Performance	7	1301	January 28, 2011
Sparse Matrix-Matrix Multiplication CUDA Programming and Performance	14	18319	June 15, 2010
Kernel with a for-loop over matrix columns results in worse bandwidth for longer ranges? CUDA Programming and Performance	4	726	April 27, 2021
Subtle problem with matrix-matrix multiplication CUDA Programming and Performance	10	1338	December 12, 2020
Sparse Matrix-Vector Multiplication on CUDA CUDA Programming and Performance	79	314118	November 22, 2010
matrix multiplication CUDA Programming and Performance	10	3947	March 7, 2010

Matrix help

Related topics