I have the following problem and hope someone might help me:
I have a big list (approximately 8000) where each element is a matrix (approximately size 15x3). Right now, I am iterating over that list and decompose each matrix using the singular value decomposition. This takes years on a CPU. Is there an easy way to do that in a parallized way on a GPU? I know that there exists multiple libraries, BUT: Numba does support thread indexing but no numpy or other libraries in it; PyCuda provides a cuda interface, but I am kind of “scared” about programming that in C (or is that problem pretty easy to implement in C/Cuda?).
Summarized: What I want to do is to parallize the for loop and let each thread compute the svd seperately for the small (15x3) matrices. Is that possible (primary in python)?
Thanks in advance :)