LU decomposition how to do the LU decomposition with CUDA of high performance

I want to do LU decomposition of dense Matrixs, but the performance is poor.
Could somebody else help me to discuss a high performance LU decomposition with CUDA?
:)

You can find a good article about it HERE

thanks!

I also found : that might help ( blocked decomposition )
http://www.noctua-blog.co.nf/index.php/2011/04/21/lu-matrix-decomposition-in-parallel-with-cuda/

http://blog-noctua.rhcloud.com/index.php/2011/04/21/lu-matrix-decomposition-in-parallel-with-cuda/