Obtaining the upper/lower triangulized matrix (CUBLAS) Basic step, required to solve a linear system

CUBLAS has a couple of solvers that you can use… but these require you to give them an already upper or lower triangular matrix. If I have a banded dense A matrix (cublasStbsv should fit pretty well), how can I get the upper or lower triangular form in a massively parallel / efficient way?

A common question - I don’t believe we have a readily available solution in CUDA for this common problem yet (solution to Ax=y, or AX=Y). There are lots of hints that people are working on it.

But see: http://forums.nvidia.com/index.php?showtopic=89084&hl=