Matrix Inversion 64x64 complex in CUDA?

Hi!

Im looking for a way to do the inversion of a 64x64 matrix with complex elements. I found out that an efficient implementation is quite difficult to write. Is there any CUDA library with functions for this available yet? I only found old post and articles from 1 or 2 years ago stating that such one is still TODO. Im also thinking of doing this step in the middle of my CUDA program by CPU…

At that size, I would probably just do it on the host. If you had a bigger matrix, it might well be worth “hybridizing” with the GPU for BLAS3 calls, but there aren’t that many flops in a 64x64 inversion or factorization to begin with, and I would be suprised if it were worth the effort.

At that size, I would probably just do it on the host. If you had a bigger matrix, it might well be worth “hybridizing” with the GPU for BLAS3 calls, but there aren’t that many flops in a 64x64 inversion or factorization to begin with, and I would be suprised if it were worth the effort.

Yup this is what I expected. Do you know of some papers referring to GPU implementations and their performance (in code or just theory)? It would be nice to have sth that shows why you wouldnt gain performance on GPU and maybe even achieve low accuracy only as this inversion is part of my bachelor thesis.

Yup this is what I expected. Do you know of some papers referring to GPU implementations and their performance (in code or just theory)? It would be nice to have sth that shows why you wouldnt gain performance on GPU and maybe even achieve low accuracy only as this inversion is part of my bachelor thesis.

There is a discussion of the relative SVD performance of CPU and GPU implementations in this report. I can’t remember off the top of my head whether they looked at complex as well as real valued SVDs, but it might be a start.

There is a discussion of the relative SVD performance of CPU and GPU implementations in this report. I can’t remember off the top of my head whether they looked at complex as well as real valued SVDs, but it might be a start.

If it’s in the middle of your program it’s worth checking if the memcpy back and forth would take more time than actually doing it on the GPU. Even if the GPU is 5x slower than the CPU for this size it might still be worth it due to the mem cpy overhead.

EDIT:

I just looked at some of my old results on a Quadro 4800 for SVD using culatools. For a 64x64 the memcpy time was something like 2 ms ( back and forth) and intel MKL compute time was around 0.15 ms. At the same time the recorded GPU kernel time was something like 30 ms ( which sounds too huge… ) which would meant that in the case of SVD it’s definetly an advantage sending the data to the CPU in this case…

If it’s in the middle of your program it’s worth checking if the memcpy back and forth would take more time than actually doing it on the GPU. Even if the GPU is 5x slower than the CPU for this size it might still be worth it due to the mem cpy overhead.

EDIT:

I just looked at some of my old results on a Quadro 4800 for SVD using culatools. For a 64x64 the memcpy time was something like 2 ms ( back and forth) and intel MKL compute time was around 0.15 ms. At the same time the recorded GPU kernel time was something like 30 ms ( which sounds too huge… ) which would meant that in the case of SVD it’s definetly an advantage sending the data to the CPU in this case…

This info and forum is worth a lot. Thanks Jimmy! Im currently looking into Lapack and boost-numeric-bindings.

This info and forum is worth a lot. Thanks Jimmy! Im currently looking into Lapack and boost-numeric-bindings.

A google that could be interesting: http://www.eetimes.com/design/signal-proce…Core-SC3850-DSP

The code (appendix) is not readily familiar, but as it is intended for multicore, it could be something of a start.

A google that could be interesting: http://www.eetimes.com/design/signal-proce…Core-SC3850-DSP

The code (appendix) is not readily familiar, but as it is intended for multicore, it could be something of a start.

Just curious, do you want to do just one inversion, or many 64x64 inversions in parallel? Do you want to get the explicit inverse or solve Ax=b? Does matrix have any structure, e.g. is it Hermitian?

Vasily

Just curious, do you want to do just one inversion, or many 64x64 inversions in parallel? Do you want to get the explicit inverse or solve Ax=b? Does matrix have any structure, e.g. is it Hermitian?

Vasily

I could do a small amount (less than 10) inversions in parallel. I really need to get the inverse so there’s no shortcut to it. My matrix is without a special structure - just a 64x64 matrix with complex elements stored in alternating order (real-imaginary-…). This seems to be a little problem because as far as I have seen, most CPU libs want both parts stored in one word/float/double. It looks like I cannot use LAPACK to do this. Any hints on a usable library are highly welcome :)

I could do a small amount (less than 10) inversions in parallel. I really need to get the inverse so there’s no shortcut to it. My matrix is without a special structure - just a 64x64 matrix with complex elements stored in alternating order (real-imaginary-…). This seems to be a little problem because as far as I have seen, most CPU libs want both parts stored in one word/float/double. It looks like I cannot use LAPACK to do this. Any hints on a usable library are highly welcome :)

LAPACK uses standard Fortran types such as COMPLEX and COMPLEX*16 to represent complex elements. In C it is struct { float real, imag; }. Isn’t it the alternating order that you want?

Vasily

LAPACK uses standard Fortran types such as COMPLEX and COMPLEX*16 to represent complex elements. In C it is struct { float real, imag; }. Isn’t it the alternating order that you want?

Vasily

Right, I want to alternate real and imaginary parts and use different floats for each of them. Afaik LAPACK needs both parts to be stored in one float/double which would be (little) additional work that I could otherwise avoid. Atm Im using ALGLIB which uses it’s own datatype (complex_2d_array) for storing such matrices. So still no perfect solution but the time needed for doing this inversion is neglectable in my app, I think.