Iterative Matrix Methods

I was responding to another post earlier this morning and mentioned that the poster might try an iterative matrix method on the GPU. I use these a fair amount in my work, so it would be neat if I could program something like this in CUDA. However, I did have one question about the speedup…

After every iteration of the matrix multiplication, I’d need to check the difference between the current and previous result vectors/matrices to see if another iteration should be performed. I imagine that in order to do that, I’d need to copy some data back to the host, check it, then (potentially) run the kernel again.

So, my question is, would having to copy a bit of data back to the host between iterations kill any/most of the speedup from moving the code to the GPU? Or would the parallelism speedup of the multiplications outweigh the cost of the transfers/checks?


You can avoid the expensive copies of the whole matrix by doing the comparison on the GPU and just copying a single float back to the host for flow control decisions.


I had mad a big integer lib for RSA with CUDA.

In the iterative step it copy the big integer back to host do the work…

I have been interested in a bignum lib for CUDA for a while… and an RSA implem…

Would your make the code freely avaiable?

I have visited your page searching for clues of it and

may be it’s exposed here:

X.-W. Chu, K.-Y Zhao, and M. Wang, “Massively Parallel Network Coding on GPUs,” IEEE IPCCC’08, Dec 2008.

But while the IPCCC’08 papers hasn’t been posted on ACM or IEEExplore can you post some info about speedup versus say OpenSSL or other fast CPU implementations…


I got a couple of good algorithms books today from the engineering library here at the university…if I have some time over the holidays, I’ll see if I can write a few small routines to solve some basic iteration problems and find out where the bottlenecks (if any) may be.

I have implemented a preconditioned CG solver on GPUs for sparse matrices and get a speed up of 5 times over my CPU implementation (which is not fully optimized), I have to read a double from the GPU to check the error term for each iteration and then determine whether to proceed or not. Once the matrix size is reasonably large, the transfer time takes a very little part of the total time (< 5%).

Just dCG timing (without the preconditioner and the matrix by vector) gives me ca. 50 times speedup when I compare NVIDIA 260 and Xeon Quad Core 2.66 with 666 FSB. On a Xeon we are using ATLAS which is almost the fastest BLAS implementation for CPUs.

With different preconditioners and matrices the speedup can be significantly different in both directions, so, maximal speedup I see was about 80 times (very sparse band matrix form Maxwell 3D tensor finite difference discretization), and minimal one was about 7 times (hierarchical preconditioners with double precision). Our results are already at the first page on Cuda Zone (Fast iterative solvers)



This paper contents a little information about the bignum lib.

We write another papers with the bignum lib. We had contributed one of them to the IPDPS2009, but it be refused (with two strong agrees,two weak reject).This paper contents three scopes, so… Maybe I’m a fresh man in the academe, I need do more works…

Anyway, the bignum lib is more faster than GNU MP in some place.

Welcome to discuss with me about the bignum lib. I want to enhance it and publish it in the next year.