Thanks to both of you for the answers :)

Peter, things are pretty much as you said. A differential operator was discretized to get a sparse matrix and multiplication by this sparse matrix was implemented as a Cg shader. Indeed, the matrix only operates locally on a vector.

The iterative algorithm is actually a conjugate gradient but I can’t write this in a single kernel call for two reasons. A kernel call, as far as I know, can only take 5 seconds which is way too short for the kind of matrices we have and the second reason is that for the stopping criterion the length of a vector has to be determined but summing over all components has to be done with 64 bit precision.

I also have another problem with stuffing too many things into a single CUDA kernel call, which is that if the operation is too complicated then too many registers are needed for the kernel and occupancy will be very low simply because the 8192 registers in one multiprocessor will be reached with few threads. I asked about this issue here: http://forums.nvidia.com/index.php?showtopic=31771 but so far there were no answers :( Am I misunderstanding something here? Because of the (maybe nonexistent?) problem in the above post I tend to do only very basic things in one kernel call and prefer to have a series of calls instead of one big complicated kernel.

I’m just beginning with CUDA but so far the OpenGL + Cg code performs something like 4-5 times better than my best CUDA implementation.