After experimenting with cuSPARSE I have reached the conclusion that using cuBLAS as much as possible is the easiest-fastest option for my work.
The issue now is the fastest way to convert a matrix in CSC sparse format on the host into device memory dense format. I can do it on the CPU(and copy the dense version over), but am looking for a faster option if possible.
cuSPARSE has the cusparseScsc2dense() function, but the documentation is less robust than cuBLAS documentation.
In particular the cuSPARSE documentation does not make clear if the cscValA, cscRowIndA, and cscColPtrA arrays are host or device.
So do I have to create those arrays in device memory, copy the host versions of them to the device, then call cusparseScsc2dense() to convert to dense on the device? Or is there another less messy and faster way? I can write my own kernel to do this but usually the functions/kernels in the SDK are more optimized.
In general I have found that cuBLAS outperforms cuSPARSE in most cases, and has far less boilerplate. Since I have a way to compute the LU decomp for dense matrices, there really is no point in using the sparse format unless I have some massive data set (over 100 million elements with less than 25% non-zeros).
It ends up the most efficient way is to copy over the CSC arrays to device memory, use cusparseScsc2dense() to convert to a dense matrix already allocated in device memory.
That whole process including allocations,copies and the conversion functtion call takes only a few milliseconds on my home PC, so that is good enough.
In general I am finding that cuSPARSE is a pain to deal with and not faster than cuBLAS. So now when I get those damn CSC matrices I just convert to dense and use my own kernels + cuBLAS to do my solvers.
The sparse format makes more sense when using the CPU, but not as much on the GPU.
It seems you already puzzled out an answer to your question. Meanwhile, I checked with the CUSPARSE team, and received the following pointers:
As stated in the CUSPARSE documentation, "[t]he CUSPARSE API assumes that input and output data reside in GPU (device) memory, unless it is explicitly indicated otherwise by the string DevHostPtr in a function parameter’s name (for example, the parameter *resultDevHostPtr in the function cusparsedoti()).”
This means that for the function in question the arrays are in device memory and in order to obtain the result of the conversion on the device, while starting from the host, a programmer would have to either
-
allocate the required memory on the device, copy the input arrays onto the device and call the conversion routine to obtain the result.
-
convert the input arrays on the host, allocate the required memory for the result on device, and copy the result from host to device.
njuffa,
The only reason I though this particular instance was different was because I mistakenly interpreted this statement:
“This function requires no extra storage. It is executed asynchronously with respect to the host and it may return control to the application on the host before the result is ready.”
to mean that this was some optimized implementation which handled the copies itself.
Unlike the good cuBLAS documentation, the cuSPARSE documentation does not explicitly classify the inputs/outputs as device or host.
I hoped to use the sparse incomplete Cholesky factor function, but realized that I did not want 0 fill-in.
Also I quite often need to calculate AT*A + eye(n) and can do this all in one call with cuBLAS Sgemm(), while with cuSPARSE I have to first estimate the number of nnz in the result, call the sparse-sparse multiply function( which does not also add to the result matrix C), then do more processing to add the Identity matrix to the result.
cuBLAS is way easier to use, and my data sets(rows*columns) are less than 200 million anyway, so it is more trouble to go through all the pre-processing steps to use sparse than to just hog the memory.
Thanks for the response.