I have an interesting problem. I am trying to implement bifactorization on the GPU, which requires me to multiply a large sparse matrix by a vector. The matrix, in this case, is around 6000x6000, and the vector is 6000 elements long. Where this gets a little interesting is: I have to step through the columns of the matrix from 0-6000. Each column must be multiplied by the vector, which changes one element of the vector (the element whose number corresponds to the matrix column). Since CUDA must be able to execute any thread block in any given order to achieve high optimization, I can’t really do this easily.

The only way I can think to do this is call the kernel with, say, 512 blocks and 1 thread per block. Terribly inefficient. In fact, this whole process is terribly inefficient, given that we can only multiply by one column of the matrix at a time.

Anyone, have a suggestion on how I should launch my kernel, as well as access just one column of the matrix at a time?

My current CUDA code looks like this:

__global__ void Matrix(double* a, double* b, int row){
int i = blockIdx.x*blockDim.x+threadIdx.x;
while(i < N*N){
if(blockIdx.x==row){
a[i] = i+.77;
}
i += blockDim.x * N;
}
}

I feel a bit ashamed for posting this code :-/, but it is the only way I can think of to step through the matrix column by column…

I have an interesting problem. I am trying to implement bifactorization on the GPU, which requires me to multiply a large sparse matrix by a vector. The matrix, in this case, is around 6000x6000, and the vector is 6000 elements long. Where this gets a little interesting is: I have to step through the columns of the matrix from 0-6000. Each column must be multiplied by the vector, which changes one element of the vector (the element whose number corresponds to the matrix column). Since CUDA must be able to execute any thread block in any given order to achieve high optimization, I can’t really do this easily.

The only way I can think to do this is call the kernel with, say, 512 blocks and 1 thread per block. Terribly inefficient. In fact, this whole process is terribly inefficient, given that we can only multiply by one column of the matrix at a time.

Anyone, have a suggestion on how I should launch my kernel, as well as access just one column of the matrix at a time?

My current CUDA code looks like this:

__global__ void Matrix(double* a, double* b, int row){
int i = blockIdx.x*blockDim.x+threadIdx.x;
while(i < N*N){
if(blockIdx.x==row){
a[i] = i+.77;
}
i += blockDim.x * N;
}
}

I feel a bit ashamed for posting this code :-/, but it is the only way I can think of to step through the matrix column by column…

Unfortunately, no. After the algorithm is done multiplying the column by the vector, the new value is stored in the vector, and in turn the new value is used in the next iteration.

But you may have just given me an idea. The algorithm actually goes along a diagonal (first under the diagonal, and then above). I need to take another look - if the algorithm is working under the diagonal, then it really shouldn’t worry about the values higher up in the vector.

That may make absolutely no sense, so I apologize. I’ll let you know if I figure out the solution.

Unfortunately, no. After the algorithm is done multiplying the column by the vector, the new value is stored in the vector, and in turn the new value is used in the next iteration.

But you may have just given me an idea. The algorithm actually goes along a diagonal (first under the diagonal, and then above). I need to take another look - if the algorithm is working under the diagonal, then it really shouldn’t worry about the values higher up in the vector.

That may make absolutely no sense, so I apologize. I’ll let you know if I figure out the solution.

Unfortunately, no. After the algorithm is done multiplying the column by the vector, the new value is stored in the vector, and in turn the new value is used in the next iteration.

But you may have just given me an idea. The algorithm actually goes along a diagonal (first under the diagonal, and then above). I need to take another look - if the algorithm is working under the diagonal, then it really shouldn’t worry about the values higher up in the vector.

That may make absolutely no sense, so I apologize. I’ll let you know if I figure out the solution.

After taking another look at the sequential algorithm, it doesn’t look like that will work. The algorithm multiplies the entire column by the entire vector and replaces the vector with the results. It then moves on to the next column of the matrix and multiples that column by the new vector.

So I guess we are back to the original question. How can I iterate through each column of the matrix in a somewhat efficient way?

After taking another look at the sequential algorithm, it doesn’t look like that will work. The algorithm multiplies the entire column by the entire vector and replaces the vector with the results. It then moves on to the next column of the matrix and multiples that column by the new vector.

So I guess we are back to the original question. How can I iterate through each column of the matrix in a somewhat efficient way?

After taking another look at the sequential algorithm, it doesn’t look like that will work. The algorithm multiplies the entire column by the entire vector and replaces the vector with the results. It then moves on to the next column of the matrix and multiples that column by the new vector.

So I guess we are back to the original question. How can I iterate through each column of the matrix in a somewhat efficient way?