iterative solvers on GPU

Okay, I believe I have a good question for GPU programming.

I am currently trying to develop a fluid simulation program on the GPU. The simulation is very iterative for mass convergence and within that mass convergence an iterative solver - conjugate gradient. I built a conjugate gradient solver based on the Concurrent Number Cruncher (CNC) code I found online, but discovered that the costs of writing back and forth to the HOST/CPU from the GPU killed any gains in performance. So I thought about pushing the whole algorithm onto the GPU using CUDA. The following illustrates the code that interfaces with the CUDA file for the GPU-Based Solver and works great:

extern "C" void cudaMatVecMult(float *matrix, unsigned int size_matrix, uint2 *rowptr, unsigned int size_rowptr, unsigned int *colind, unsigned int size_colind, float * x, float *b, unsigned int size_vec);

extern "C" void cudaVecVecMult(unsigned int size, float *x, float *y, float *r);

...

void GPUFertm::gpuMult(const void *x, void *y, unsigned int vec_size){

	// Invoke the C-Code function that in-turn calls the

	// CUDA/GPU Kernel for Matrix-Vector Multiplication

	cudaMatVecMult((float*)gAhat, matrixSize_, (uint2*)gpu_rowptr, rowLength_, 

		(unsigned int*)gpu_colind, colLength_, (float*)x, (float*)y, vec_size);

}//gpuMult()

...

bool GPUFertm::pccgSolver(float *presoln){

	try{

		// Validity Check(s):

		if(N_ == 0 || rowLength_ == 0 || colLength_ == 0 || matrixSize_ == 0){

			printf("Invalid values set for MASS-Convergence Call!\n");

			return false;

		}

		

		// Local Variable(s):

		unsigned int its = 0;

		float alpha = 0.0f, beta = 0.0f;

		// r = A*x

		gpuMult(gpu_x, gpu_r, N_);

		

		// r = b - A*x

		cublasSaxpy(N_, -1.0f, (float*)gpu_b, 1, (float*)gpu_r, 1);

		cublasSscal(N_, -1.0f, (float*)gpu_r, 1);

		// d = M-1 * r

		cudaVecVecMult(N_, (float*)gAdiagpre, (float*)gpu_r, (float*)gpu_d);

		// cur_err = rT*d

		float cur_err = cublasSdot(N_, (float*)gpu_r, 1, (float*)gpu_d, 1);

		

		// err = cur_err

		float err = (float)(cur_err * EPSILON * EPSILON);

		while (cur_err > err && (int)its < MAX_ITERS) {

			// Ad = A*d

			gpuMult(gpu_d, gpu_Ad, N_);

			// alpha = cur_err / (dT*Ad)

			alpha = cur_err / cublasSdot(N_, (float*)gpu_d, 1, (float*)gpu_Ad, 1);

			// x = x + alpha * d

			cublasSaxpy(N_, alpha, (float*)gpu_d, 1, (float*)gpu_x, 1);

			// r = r - alpha * Ad

			cublasSaxpy(N_, -alpha, (float*)gpu_Ad, 1, (float*)gpu_r, 1);

			// h = M-1r

			cudaVecVecMult(N_, (float*)gAdiagpre, (float*)gpu_r, (float*)gpu_h);

			

			float old_err = cur_err;

			

			// cur_err = rT * h

			cur_err = cublasSdot(N_, (float*)gpu_r, 1, (float*)gpu_h, 1);

			

			beta = cur_err / old_err;

			

			// d = h + beta * d

			cublasSscal(N_, beta, (float*)gpu_d, 1);

			cublasSaxpy(N_, 1.0f, (float*)gpu_h, 1, (float*)gpu_d, 1);

			++its;

		}//WHILE-LOOP (PCCG Convergence)

		// Retrieve the solution from GPU:

		// (GPU->CPU)

		cublasGetVector(N_, 4, (float*)gpu_x, 1, presoln, 1);

		// Stop GPU:

		stopGPU();

		// Operation was successful?

		return (its < MAX_ITERS);

	}catch(...){

		return false;

	}

}//pccgSolver()

The important aspects of the CUDA File follow:

extern "C" void cudaMatVecMult(float *matrix, unsigned int size_matrix, uint2 *rowptr,  unsigned int size_rowptr, unsigned int *colind,unsigned int size_colind, float *x, float *b,unsigned int size_vec);

extern "C" void cudaVecVecMult(unsigned int size, float *x, float *y, float *r);

__device__ unsigned int compute_thread_index() {

	return (blockIdx.x*THREAD_BLOCK_SIZE*THREAD_BLOCK_SIZE+

			 blockIdx.y*THREAD_BLOCK_SIZE*THREAD_BLOCK_SIZE*gridDim.x+

			 threadIdx.x+threadIdx.y*THREAD_BLOCK_SIZE);

}

...

void cudaMatVecMult(float *matrix, unsigned int size_matrix, uint2 *rowptr,

					unsigned int size_rowptr,unsigned int *colind, unsigned int size_colind,

					float *x, float *b, unsigned int size_vec) {

	MatVecMultKernel<<<dimGrid, dimBlock>>>(matrix, size_matrix,rowptr, size_rowptr,

		colind, size_colind, x, b, size_vec);

}

void cudaVecVecMult(unsigned int size, float * x, float * y, float * r) {

	VecVecMultKernel<<<dimGrid, dimBlock>>>(size, x, y, r);

}

__global__ void MatVecMultKernel(float *matrix, unsigned int size_matrix, uint2 *rowptr, 

								 unsigned int size_rowptr, unsigned int *colind, 

								 unsigned int size_colind, float *x, float *b,

								 unsigned int size_vec) {

	const unsigned int index = compute_thread_index();

	if (index < size_vec) {

		uint2 rowptr_bounds = rowptr[index];

		float res = 0.0f;

		// For each block of the block_row, mult

		for (unsigned int i = rowptr_bounds.x; i < rowptr_bounds.y; i++) {

			res += matrix[i]*x[colind[i]];

		}

		b[index] = res;

	}//IF-CONDITIONAL

}//MatVecMultKernel()

__global__ void VecVecMultKernel(unsigned int size, float *x, float *y, float *r) {

	const unsigned int index = compute_thread_index();

	if (index < size) 

		r[index] = x[index]*y[index];

}//VecVecMultKernel()

I decided to push the entire solver onto the GPU. The code (from CNC) uses different calls to GLOBAL Kernels (gpuMult, and cublas) does this mean that each call to the different kernels defines a new communication across the PCIe?

As I said, I want to keep as much of the simulation on the GPU as possible - minimizing read/writes to the host.

The general simulation algorithm is:

  1. Calculate Pressures

  2. Create Diagonal Precondtioner for solver

  3. Save previous iteration solution

  4. Solve system of equations Ax = b (already on GPU ?)

  5. Update system

  6. Does Convergence occur? Yes - EXIT, NO - Go to 1

Any Ideas?

Thank you.

Ah, iterative solvers in CUDA are very close to my heart. If you’re doing this on Vista, you picked the case that stresses the performance limitations of WDDM the most. Basically, we have to batch kernel calls to amortize the cost of launching any work on Vista (as it is an order of magnitude higher than XP or Linux), so checking the results after every iteration is not necessarily conducive to performance.

What I would suggest is this:

  1. Modify all of your kernels so that every block first looks against a location in constant memory to determine whether the kernel should run at all.
  2. Perform all of your calculations necessary to determine whether convergence has occurred on the GPU, storing a single go/no go result in global memory somewhere.
  3. cudaMemcpyToSymbolAsync the result of the computation from global memory to constant memory.
  4. Launch 1-3 N times before copying the result of the computation in 2 back to the CPU to see if N more iterations must be launched. You’ll have to determine N based on the runtime of your kernel and how often you’re willing to pay the copy and submission overhead.

Basically, if you do this, you minimize the cost of roudntrips to the GPU and the overhead of GPU submission, and the latency hit from between when you’ve reached convergence and when your batch of kernels should be fairly minimal (since the constant cache has to be initialized once per SM per kernel and every other block on that SM after the first will hit cache).

I can’t take credit for this idea, somebody recently suggested it to me. (I once wrote some… interesting code to solve this in another way, but we shouldn’t get into that)

Thank you for the reply.

I have another machine that is using Windows XP and has CUDA 2.3 installed on it, would this change the algorithm much? Also, can you point me in the direction of papers/code that have done work with simulations on the GPU using CUDA? It would be interesting and helpful to see how others have tackled similar problems.

Once again, thanks.