Parallel Cyclic Reduction

I don’t know the code in CUDPP but I wouldn’t be surprised if it was based on the first paper below. Both cyclic reduction papers that I know,
http://graphics.cs.ucdavis.edu/publication…_pub?pub_id=978 and my own (Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid | IEEE Journals & Magazine | IEEE Xplore) solve many smaller tridiagonal systems in parallel, i.e., one system per block. One could probably come up with a technique for larger systems (similar to the “scan large arrays” example in the SDK), but that inadvertedly would be much less efficient because of many more roundtrips to off-chip memory.