Huge initialization delay using CUBLAS and CUSPARSE

kslimes · January 26, 2012, 4:12pm

I’m trying to run some code which generates a matrix and performs a Lanczos diagonalization on it. Previously, I’ve run the matrix creating part with no problems (takes an average of about 0.1 s). However, I’m using CUBLAS and CUSPARSE functions in my diagonalizer. When I try to run both parts, the code takes 30s to run. However, I’ve used cuda event timers to check how long each part is taking, and they report that the time needed for each part is:

Matrix generation - 156.426 ms
Lanczos diagonalization - 13.717 ms

Which doesn’t add up to 30s. When I run nvvp, it shows whatever my first CUDA call (usually cudaEventCreate or cudaMalloc) taking 25-30s. I’ve also tried using device code repos as described in the nvvc documentation, but this hasn’t fixed the problem.

Does including CUSPARSE and CUBLAS really cause a 30s runtime initialisation penalty? If not, what can I do to fix this?

I’m running Ubuntu x64 10.10, driver 285.05.15, CUDA toolkit 4.1.15 on a GeForce 560Ti.

philippev · January 30, 2012, 7:47pm

As you might have noticed, CUBLAS and CUSPARSE Libraries are pretty big because they also contain the PTX version along with the Binary SASS of every kernel. PTX is kept there in case that you use a future chip with that Library version, the driver will be able to “jit” those kernels. PTX can take up to 75% of the Library size.

The Library get loaded a the first invocation of any CUDA runtime routine. Depending on where those libraries are located (Local Hard Disk or network),the loading time may vary. The driver has to read those libraries to extract the kernels SASS and load them on your board.
In the toolkit 4.1 Final, we already improved by about a factor 2 the loading of the CUSPARSE library.

For example, on my own Linux PC (Ubuntu 64bit), when the libraries are located on my local Hard disk :

Cublas takes 0.63 sec to load and Cusparse 2.41 sec on the first run.
On the second run, with the help of Linux file caching, CUBLAS loading falls to 0.28 sec, and CUSPARSE 1.47 sec

I am still surprised that you see an initialization time up to 30 sec. Where is your CUDA toolkit installed ?
Is it possible that you pick up some Libraries through NFS on a potential slow network?
Your problem might not be only that library loading problem

For the future, we plan to work on a PTX compression scheme to reduce significantly the size of our Libraries.

kslimes · February 1, 2012, 8:16pm

As you might have noticed, CUBLAS and CUSPARSE Libraries are pretty big because they also contain the PTX version along with the Binary SASS of every kernel. PTX is kept there in case that you use a future chip with that Library version, the driver will be able to “jit” those kernels. PTX can take up to 75% of the Library size.

The Library get loaded a the first invocation of any CUDA runtime routine. Depending on where those libraries are located (Local Hard Disk or network),the loading time may vary. The driver has to read those libraries to extract the kernels SASS and load them on your board.

In the toolkit 4.1 Final, we already improved by about a factor 2 the loading of the CUSPARSE library.

For example, on my own Linux PC (Ubuntu 64bit), when the libraries are located on my local Hard disk :

Cublas takes 0.63 sec to load and Cusparse 2.41 sec on the first run.

On the second run, with the help of Linux file caching, CUBLAS loading falls to 0.28 sec, and CUSPARSE 1.47 sec

I am still surprised that you see an initialization time up to 30 sec. Where is your CUDA toolkit installed ?

Is it possible that you pick up some Libraries through NFS on a potential slow network?

Your problem might not be only that library loading problem

For the future, we plan to work on a PTX compression scheme to reduce significantly the size of our Libraries.

I’ve got CUDA 4.1.15 installed with 285.05.15 as my devdriver. My libraries are installed on my local hard disk (in /usr/local/cuda).

Should I try installing the final release of 4.1 and see if that fixes things? I had a similar problem back with 3.2.

philippev · February 1, 2012, 8:32pm

Yes,please.

It will not fix totally the problem but should improve the delay.

If you do not see any improvement, then you might have another problem.

kslimes · February 2, 2012, 4:42am

Installing the new driver + toolkit has helped quite a bit - thank you! I’m down to 9.6s now. Is there anything else I can do speedup this initalization time?