Multi-GPU Implementation of the Minimum Volume Simplex Analysis Algorithm for Hyperspectral Unmixing

A new and exciting paper I made. It took me one year to reach in this state but certainly worth the time spent. Enjoy.
In this paper I efficiently implement the interior point algorithm in a Multi-GPU fashion using MPI. I believe it is one of the few attempts that the interior-point algorithm has been made multi-gpu reducing the communication cost to the absolute necessary. I believe it is worth reading.

[url]http://www.umbc.edu/rssipl/people/aplaza/Papers/Journals/2014.JSTARS.GPUMVSA.pdf[/url]

I had actually just skimmed through this paper an hour ago (on [url]http://hgpu.org/[/url]), and appreciated the fact that you included fairly clear implementation instructions(pg 14).

The vast majority of papers published in the GPU area are extremely vague and at best only include pseudo-code. Since much of GPU programming is related to implementation details, such ‘vague’ papers are of little practical worth.

Actual source code is best, but your ‘CUDA pseudocode of the optimized predictor corrector interior point algorithm’ section is substantive enough to implement.

Thanks

Thanks, from what I recollect you are also doing interesting work on GPUs and Scientific ones.

All the best,
Alexander.

Just one comment. The presented algorithm is both suitable both for Tesla and Intel Xeon Phi. It involves BLAS operations which are heavily optimized both for cuBLAS and Intel MKL targeted for mic architectures. In our University we have only Teslas. People who have Phis can go ahead and program it there, should find it extremely easy to do so. The reviewer was very strict in making the paper as approachable as possible and to mention communication/IO in our experiments so we tried our best and he/she got satisfied. So its approachability that we present is due to his/her insisting remarks.

It will be interesting to compare a Tesla and a Xeon Phi on this paper with the same number crunching capabilities. I might do it in the future. It is scalable and vector-based. So its perfect for both accelerators.

Just an inspiration. What about HyperQ (stream based). You can declare vectors for each stream to handle portions. There is something missing and this is the reduce operation that MPI has for vectors in order to eliminate the IO communication. Just an inspiration for NVIDIA. :-) Actually for everyone. ;-) Streams can synchronize as processes in MPI it is doable with the just the current technology…