Massive matrix calculation issue

Hello, thank you for reading my issue.
I want to check something, but my code is tremendously big and it makes me difficult to convert to other side.

My issue is very simple, I made a program that calculates…

  1. “One calculation” includes number of serial matrix calculations, having several matrix multiplications and matrix inversions.
  2. Each matrix have N/M*N/M size, and N is very big(1,000~10,000) but M is very small(up to 16).
  3. All matrices are stored in the global memory of graphic card.
  4. Each thread calculates “one calculation”
    But as you can expected, it is SO slow…

[font=“Courier New”]±-+
|  |-+ --> Each calculation --> Each thread calculates whole matrix
±-+ |-+    divided into
  ±-+ |    each thread
   ±-+[/font]
So I thought I could try another approach like below…

  1. Each small(but can I call it is “slow”?) calculation in “one calculation” will be calculated from set of threads.
  2. One calculation will be executed in single kernel execution, so only 1 of “one calculation” could executed in one GPU in same time.
  3. Each small calculation will be synchronized so serial matrix calculations could be achieved.
    This method is still not implemented

[font=“Courier New”]
±-----+   ±-+ ±-+
|    |   | | | |
|    |   ±-+ ±-+
|    | --> A matrix divided into --> Threads calculate each submatrices
±-----+   ±-+ ±-+ submatrices
        | | |  |
        ±-+ ±-+[/font]

Could someone give me an answer which method is more fast?
Thank you very much.