I’m trying to apply CUSP to the CG solver in my circuit simulation program for acceleration, but actually gained slower speed than that of my CG code on CPU (single-thread).
My program uses Newton-Raphson (NR) to solve nonlinear circuit networks. A linear equation system, which is symmetric positive definite, is generated in each NR iterative sequence and is solved by CG. I use the COO format (double precision) to organize the equation system.
I’m using an i-3 2.6GHz CPU with a GTX 760 GPU (compute capability 3.0), running CUDA6.0 and CUSP0.4. I use Visual Studio 2010 as the environment. What I’m doing is, in each NR sequence, push the COO matrix A, as well as x and b vectors to the device memory, call the CUSP CG function, can pull the x vector back to the host memory. The average number of NR iterations is 6-8. The CUSP CG gave approx. 10 times slower than my single-thread CG code.
There are several issues that I’d like to discuss:
- I selected CUDA Project in VS2010, do I need to change/add any options during compiling? Is there any chance that the code is not executed in parallel (say, the GPU becomes a slow CPU…)?
- Double precision can be an important reason that limit the speed - but should it still be faster than a solo CPU?
- I do host-device-host memory exchange in every NR sequence, should the overhead be a series problem?