Python: Use thread indexing as well as linear algebra stuff

Hey guys,

I have the following problem and hope someone might help me:

I have a big list (approximately 8000) where each element is a matrix (approximately size 15x3). Right now, I am iterating over that list and decompose each matrix using the singular value decomposition. This takes years on a CPU. Is there an easy way to do that in a parallized way on a GPU? I know that there exists multiple libraries, BUT: Numba does support thread indexing but no numpy or other libraries in it; PyCuda provides a cuda interface, but I am kind of “scared” about programming that in C (or is that problem pretty easy to implement in C/Cuda?).

Summarized: What I want to do is to parallize the for loop and let each thread compute the svd seperately for the small (15x3) matrices. Is that possible (primary in python)?

Thanks in advance :)

Horsti

[Later:] Numba documentation (if I understand it correctly) indicates the existence of numpy.linalg.svd. Have you checked whether that is suitable for your use case? @Robert_Crovella can probably give you much better pointers. If I recall correctly, he has a lot of expertise in Python+CUDA.

I am assuming “years” of CPU time is poetic license. What is the type of the matrix elements? Roughly how long does the processing currently take and how are you compiling the code? Have you profiled it? Is the context of this computation machine learning? If so, GPU accelerated machine learning frameworks may offer this functionality, possibly called “batched SVD decomposition”. I don’t have any personal knowledge or experience in this regard though.

Surely there must be some open-source SVD code in C or C++ somewhere on the internet? With that in hand, fire up 8000 GPU threads and have each thread work on one of these matrices. For very small matrices like this it is usually OK to not parallelize across the elements of the matrix and just have each thread operate on its own matrix.

I will point out that 8000 matrices of size 15x3 are small enough that modern CPUs will make excellent use of their caches to speed up the processing, and that if you parallelize across cores via OMP and use a recent compiler that supports autovectorization you might actually get very good performance on CPUs a well.

Thank you very much for your quick response :) If I get it correct, numba only provides numpy for their CPU optimizations. In their cuda section they explicitly mention: “Unsupported numpy features: array creation APIs. array methods. functions that returns a new array.”

Maybe I should add a more brief description: I am decomposing each matrix using SVD (I’ve just measured the time right now using the time library. This needs about 0,1-0,4 ms). Afterwards I am solving a minimization problem using BFGS based on each SVD for most of the matrices (this takes about 3ms for each matrix). After that is done, I need to have a for loop for some matrices. Summarized: SVD of each matrix, solve optimization problem for some matrices, loop over some matrices. Right now I am doing this sequentially, which takes some seconds (about 10), which is to much and actually sounds perfectly for parallelization.

I hope this clears my application a bit up.

There are some C++ codes to parallize the SVD such as: Parallel implementation for multiple SVDs using CUDA - Stack Overflow, but I want to use python.

A similar question was posted on SO which I made a brief response to. It is now deleted and another question was posted on SO and another question posted here. It seems evident that OP wants someone to provide a ready-made solution and I don’t know of one. Anything I could suggest will require some work/effort. I would simply repeat a comment on the latest SO question: "Unless you write your own batched SVD implementation, I doubt there is a solution " In my previous comment, I gave a link to a pythonic implementation of a batched matrix-inversion method (one per CUDA thread) to give an idea of a possible methodology, but the response was “that doesn’t help”, so it seems evident that OP is looking for an off-the-shelf solution. I don’t know of an off-the-shelf way to do one SVD per CUDA thread. However I’m quite certain it is doable, with some programming effort. As njuffa stated, there are SVD implementations in “pure” C++ available on the internet and one of these could be dropped in a CUDA thread. Place that in a pycuda wrapper, done. That’s essentially what I did for the one-matrix-inversion per thread I mentioned, although small matrix inversion is quite a bit simpler code-wise than SVD. It actually looks like the BFGS time would be the more significant thing to focus on for optimization, but I acknowledge there really isn’t enough info here to draw that conclusion.

Yes, I was/am looking for a pretty easy way to implement it, because it is actually not the focus (that I can spend to much time on it) of my work. I thought there might be an easy way. Maybe I give the C++ Version a try to integrate it, but I assume because I am very new there would occur some problems (thats the reason why I have tried to avoid it). Indeed the BFGS needs most of the time, but as I tried to imply, I just thought that there might be an easy way, where I can run some numpy/scipy/… functions in a specific thread. But after all the answers I guess it is much more difficult than I thought. But thank you very much for the time you spend to answer :)

Yes, it would not be a trivial undertaking.

For example, for the C++ example I linked, all usage of std::vector would have to be refactored, and there would be other std:: usage that would have to be dropped. Observing this, while working on your other, now-deleted question (apparently answers are expected immediately), I did a quick test with Eigen 3.3.7, since Eigen has several SVDs and has “some” adaptation for use in CUDA device code. Unfortunately I discovered the Eigen SVD are not adapted for direct use in CUDA device code. Other factors would be to define the exact input format. For example, for the 8000 matrices, are they all the same dimension? If not, some additional complexity in arranging an input format/scheme, perhaps with pointer-to-pointer methods, further complicating things. The devil, as they say, is in the details.

From this I conclude that BGFS is the bottleneck, not SVD. Unless I overlooked it, my question about the matrix element data type wasn’t answered. If it is double precision, any consumer-grade GPU is unlikely to help because a reasonably high-end CPU is going to match a reasonably high-end consumer GPU in double-precision throughput. I may be missing something, and it’s been years since I last looked at matrix decompositions, but 0.1 ms for the decomposition of a 13x5 matrix sounds significantly too high.

[Later:] I measure about 3 microseconds for a 15x3 double-precision SVD on a single core of a Xeon E3 1270 v2 (Ivy Bridge) , compiled from not very optimal-looking C source with the Intel compiler version 13.1.3.198 at maximum optimization. I found performance data from someone performing double-precision SVDs on 100x5 matrices with Intel MKL, and they report 6 microseconds per SVD on their unspecified machine.

Cusolver has a batched SVD routine: