Bad performance using CUSP conjugate gradient...

Hi All,

I’m trying to apply CUSP to the CG solver in my circuit simulation program for acceleration, but actually gained slower speed than that of my CG code on CPU (single-thread).

My program uses Newton-Raphson (NR) to solve nonlinear circuit networks. A linear equation system, which is symmetric positive definite, is generated in each NR iterative sequence and is solved by CG. I use the COO format (double precision) to organize the equation system.

I’m using an i-3 2.6GHz CPU with a GTX 760 GPU (compute capability 3.0), running CUDA6.0 and CUSP0.4. I use Visual Studio 2010 as the environment. What I’m doing is, in each NR sequence, push the COO matrix A, as well as x and b vectors to the device memory, call the CUSP CG function, can pull the x vector back to the host memory. The average number of NR iterations is 6-8. The CUSP CG gave approx. 10 times slower than my single-thread CG code.

There are several issues that I’d like to discuss:

  1. I selected CUDA Project in VS2010, do I need to change/add any options during compiling? Is there any chance that the code is not executed in parallel (say, the GPU becomes a slow CPU…)?
  2. Double precision can be an important reason that limit the speed - but should it still be faster than a solo CPU?
  3. I do host-device-host memory exchange in every NR sequence, should the overhead be a series problem?

Many thanks,

The use of 64 bit will definitely slow down the performance on that lower-end consumer GPU, try 32 bit.

Also if possible use cuSPARSE if you need to use a sparse library, as I bet it would be faster and more options for the implementation:

http://docs.nvidia.com/cuda/cusparse/index.html#topic_11_14

1. I selected CUDA Project in VS2010, do I need to change/add any options during compiling? Is there any chance that the code is not executed in parallel (say, the GPU becomes a slow CPU…)?

you can set the -use_fast_math flag to see if that helps. Also make sure that the -G and -g flags are NOT set (because VS sets those on as default in debug)

2. Double precision can be an important reason that limit the speed - but should it still be faster than a solo CPU?

On that GPU yes. If you were using a Titan or Tesla not as much.

3. I do host-device-host memory exchange in every NR sequence, should the overhead be a series problem?

Yes if you have a x8 or PCI-e 2.0 bus speed. Check the bandwidth test in the CUDA SDK. For that GPU it should be around 10-12 GBS each direction

Conjugate gradient as an iterative solution method usually implies an iterative descent towards the solution. The number of iterations required will have a significant impact on the value of the GPU. If your iterative solution is requiring ~75 or more iterations to converge, then the cost of transferring the A matrix prior to each solution descent will be amortized over a relatively large number of calculation iterations. If on the other hand, it requires only a few iterations to “converge” then the GPU will offer less benefit (due to the “cost” of transferring the data on each solution cycle). Using a library like CUSPARSE, the number of iterations would be obvious. Using CUSP (i.e. the solver in CUSP) less so, as you are usually specifying a convergence criteria (and max iterations).

Any time you have an algorithm where you are transfering data to and from the GPU and repetetively performing the same processing step on the GPU, it suggests that you might want to transfer more work to the GPU if possible, such that the overall loop is being performed on the GPU, thus eliminating or reducing the data transfer requirements at each step. Comparing highly optimized GPU code (e.g. library) vs. highly optimized (e.g. multicore library, or OpenMP multicore) CPU solution of sparese CG, at best we usually only see a ~4x speedup, when the number of iterations of CG descent is “large”.

Due to the nature of sparse CG solution, this speedup ratio is largely driven by the ratio of memory bandwidth.

Thank you very much for the detailed reply and kind suggestions!

I am using cusp library for SpMV operation for CSR COO ELL HYB DIA format. But it is even slower than CPU, when i use scipy library. I am reading mtx file (from sparse suite collection dataset) on device and and performing multiply operation with a dense vector of one. When i do nvidia-smi it always shows that 75 MB memory is used by GPU even with 1 GB mtx file.
my code which i am using to perform SpMV supposed to run on GPU with CUSP library is:

//read mtx on device in COO format
cusp::coo_matrix<int, float, cusp::device_memory> coo_device;
cusp::io::read_matrix_market_file(coo_device, mtx_file);

// allocate storage for output Y and input X on device
cusp::array1d<float, cusp::device_memory> Y_device(coo_device.num_rows, 0);
cusp::array1d<float, cusp::device_memory> X_device(coo_device.num_cols, 1);
//-----------------COO format---------------------
if (strcmp("coo", format) == 0) {
  // warm up call
  cusp::multiply(coo_device, X_device, Y_device);
  timer t;
  for (int i = 0; i < num_trials; i++)
    cusp::multiply(coo_device, X_device, Y_device);
  cudaThreadSynchronize();
  time_ = t.seconds_elapsed() / num_trials;
}

else if (strcmp("csr", format) == 0) {
  // convert to csr format
  cusp::csr_matrix<int, float, cusp::device_memory> csr_device;
  try {
    csr_device = coo_device;
    // cusp::convert(coo_device, csr);
  } catch (cusp::format_conversion_exception) {
    std::cout << "\tUnable to convert to CSR format" << std::endl;
    return -1;
  }

// warm up call
cusp::multiply(csr_device, X_device, Y_device);
timer t;
for (int i = 0; i < num_trials; i++)
cusp::multiply(csr_device, X_device, Y_device);
cudaThreadSynchronize();
time_ = t.seconds_elapsed() / num_trials;