Parallel Cyclic Reduction

I am trying to understand the parallel cyclic reduction algorithm described in this cudpp source code:
[url=“http://cudpp.googlecode.com/svn-history/r96/branches/tridiagonal/cudpp/src/kernel/pcr_kernel.cu”]http://cudpp.googlecode.com/svn-history/r9...l/pcr_kernel.cu[/url]

Am I correct to assume, that if I want to use more than one block, I have to decouple the tridiagonal system to smaller tridiagonal systems equal to the number of block i want to use? For example, suppose i want to solve a tridiagonal matrix of size 1024 times 1024, i need to decouple this into 2 tridiagonal matrix of 512 times 512 fist, before i can use the kernel provided with two blocks?

How should I decouple the tridiagonal system if that is the case? Should I modified the kernel such that only global memory is used, run a few iterations until the system is decoupled to sufficient number of smaller tridiagonal system, then solve these different system using the original kernels?

I don’t know the code in CUDPP but I wouldn’t be surprised if it was based on the first paper below. Both cyclic reduction papers that I know,
http://graphics.cs.ucdavis.edu/publication…_pub?pub_id=978 and my own (Cyclic Reduction Tridiagonal Solvers on GPUs Applied to Mixed-Precision Multigrid | IEEE Journals & Magazine | IEEE Xplore) solve many smaller tridiagonal systems in parallel, i.e., one system per block. One could probably come up with a technique for larger systems (similar to the “scan large arrays” example in the SDK), but that inadvertedly would be much less efficient because of many more roundtrips to off-chip memory.

Thanks for the links. I am reading the first paper, and it is quite enjoyable. I would read your paper once i finish with the first one.

What i don’t understand, is how to solve for systems with more than 512 equations.

From some paper i have read, the usage of shared memory is important for speed up. For my application, i intend to solve a 3D diffusion equation with 100 times 100 times 100 elements. I find it hard to think of a way that uses the shared memory to speed up the system in this case. This can be done using global memory, but i think the performance will be seriously compromised.

Is there anyway that can solve such a system?

Somewhat related, this presentation from GDC uses PCR (implemented in DX Compute) to implement a depth-of-field effect:
[url=“http://developer.download.nvidia.com/presentations/2010/gdc/metro.pdf”]http://developer.download.nvidia.com/prese...0/gdc/metro.pdf[/url]

Excuse my mathematical ignorance, but won’t a 3D diffusion equation give you a system which isn’t tridiagonal? How would you use a cyclic reduction for that?

I have read the presentation. From what i have learnt from it (which may be wrong), i am thinking of breaking the 1024 times 1024 tridiagonal system into 64 blocks of smaller tridiagonal matrix first using PCR without using shared memory, then solve the 64 subsystem using Thomas Algorithm.

  1. Assign one thread to each of the 1024 equations.

  2. Perform PCR for 16 iterations.

  3. Assign one thread to each subsystem, then solve each system using thomas algorithm. (I think this is what they meant by simplified gaussian elimination)

or alternatively,

  1. Assign each blockid to one subsystem, then solve using the algorithm given in the CUDPP (which is also based on PCR i think)

Would there by severe performance hit using this approach, since no shared memory is used at all (except for alternative 3)?

True, Crank-Nicolson algorithm will result in a non-tridiagonal sparse matrix. I am using Alternating-Direction Implicit Method (ADI), which splits the operator. Basically, the idea is to divide each timestep into three steps of size del t /3 , then in each substep, a different dimension is treated implicitly. Each substep will produce a tridiagonal matrix.

A good reference is Numerical Recipes series. The ADI methods is described in a clear and simple manner in the chapter on PDE.