Parallelization in block or thread?

Suppose I have n tasks, for example, solving n linear equations Ax=b. The size of A is m*m。
Should I create n blocks, in which there are m threads to perform Gauss elimination? Or n threads, each thread for each linear equation?
I think the answer should be problem dependent. However, is there any rule of thumb to follow, so that I don’t need to develop two versions of code to make comparison?

It is as you say, problem dependent.
Recently I had to answer this same question, which version is better, and needed to code 2 versions of a solution that involved 1 block with its threads running some 128 or 256 iterations each, and another version with as many blocks as there were iterations, with each thread running only its own unique iteration. It turned out to be the fastest version by quite a margin over the first.

Though this is far from trivial to guess which version is the best solution for a problem without coding it, there are a few assumptions that can be made with this programming model:

  • Implementations that have thread divergence (not all threads do the same work) impact performance, but it is sometimes unavoidable. Ex: the last line of a parallel reduction; a stencil thread checking if it is within the radius and within the array boundaries.
  • Code that makes threads go through deep iterations will certainly perform worse than if you spread the work over multiple blocks and each thread doing a smaller piece of work, like the example I gave. Again, there might be cases where it is unavoidable.
  • The list can be very extensive.

If n is large enough, that is n is approximately 2*# of SMs in your GPU, or larger, I would try the first method (solving one GJE elimination/matrix per block). The reason for this is that the GJE algorithms I have looked at all require global (i.e. matrix-wide) sync. This is easier to implement at the block level using __syncthreads() than at the grid level.

Thank you all. The n is around 6000. And inside each task I need to run Gauss Elimination several times. The algorithm for each task is very complicated. I think it is easier for me to create 6000 threads to run 6000 tasks. This version takes me less time to implement,because it is closer to the CPU version. Maybe it’s faster too.

It will be slower than a block-optimized version. But try it. It’s a good learning experience, and as you say, its easier to implement.