Why does CULASparse solver run much faster on CPU than on GPU?

Anybody used CULASparse Solver before? I have downloaded the CULASparse demo program from CULA site (http://www.culatools.com/downloads/sparse/) and tested with different Matrix sizes with the program. The Solver can run on HOST (CPU) or CUDA_DEVICE (GPU).

What I found is that with Matrix size ranges from 100x100 to 1000000x1000000, I got much better performance on the CPU solver than on the GPU solver. This is really a surprise for me. Can anyone explain the results?

Matrix size: 1000000x1000000 (NNZ=8999974)
CPU Solver Result : Solver Time:0.1s, Total Time:0.72s
GPU Solver Result : Solver Time:0.57s, Total Time:1.62s

Matrix size: 1000x1000 (NNZ=1100)
CPU Solver Result: Solver Time:0.0032s, Total Time:0.0033s
GPU Solver Result: Solver Time:0.46s, Total Time:0.79s

The machine I’m testing with is a MacBook Pro with i7-4850HQ 2.30GHz CPU and GetForce 750 GPU.
I have also tested on a TESLA GPU Server box, got similar result.

This sounds very strange. Have you activated ‘persistence mode’ on your GPU? You can check with

mvidia-smi

Chung,

Thanks for the response. I’m new to CUDA. I don’t know how to activate ‘persistence mode’ on windows system. nvidia-smi document says that setting is “Available on Linux only”.

The fact that both the tiny problem and the large problem take almost the same amount of time on the GPU might suggest that you are timing some startup cost. Are you running an existing benchmark or did you write one yourself?

The data in my first post was from running the existing benchmark program I downloaded from CULA site. The solver time and total time was from the screen output of the program. That’s why I’m very puzzled.

I did comparison run first with my own program and noticed CPU solver was faster than the CUDA_Device solver with matrix size 5000 x 5000, not even counting the data transfer time. I wanted to make sure it is not the problem with my code, so I downloaded the benchmark program to do this test.

[quote=“sladapter”]

The data in my first post was from running the existing benchmark program I downloaded from CULA site. The solver time and total time was from the screen output of the program. That’s why I’m very puzzled.

I did comparison run first with my own program and noticed CPU solver was faster than the CUDA_Device solver with matrix size 5000 x 5000, not even counting the data transfer time. I wanted to make sure it is not the problem with my code, so I downloaded the benchmark program to do this test.

Here is one screen output from my test with Matrix Size: 23556x23556 NNZ=484512:
data format: coo
platform: host
preconditioner: factorized approximate inverse
solver: bicgstab
iterations: 4
overhead time: 0.00471722s
precond. time 0.0283143s
solver time: 0.00898205s
total time: 0.0420141s

data format: coo
platform: cuda
preconditioner: factorized approximate inverse
solver: bicgstab
iterations: 4
overhead time: 0.32134s
precond. time 0.440139s
solver time: 0.200331s
total time: 0.961811s