Hello, thank you for reading my issue.
I want to check something, but my code is tremendously big and it makes me difficult to convert to other side.
My issue is very simple, I made a program that calculates…
- “One calculation” includes number of serial matrix calculations, having several matrix multiplications and matrix inversions.
- Each matrix have N/M*N/M size, and N is very big(1,000~10,000) but M is very small(up to 16).
- All matrices are stored in the global memory of graphic card.
- Each thread calculates “one calculation”
But as you can expected, it is SO slow…
[font=“Courier New”]±-+
|  |-+ → Each calculation → Each thread calculates whole matrix
±-+ |-+    divided into
  ±-+ |    each thread
   ±-+[/font]
So I thought I could try another approach like below…
- Each small(but can I call it is “slow”?) calculation in “one calculation” will be calculated from set of threads.
- One calculation will be executed in single kernel execution, so only 1 of “one calculation” could executed in one GPU in same time.
- Each small calculation will be synchronized so serial matrix calculations could be achieved.
This method is still not implemented
[font=“Courier New”]
±-----+   ±-+ ±-+
|    |   | | | |
|    |   ±-+ ±-+
|    | → A matrix divided into → Threads calculate each submatrices
±-----+   ±-+ ±-+ submatrices
        | | |  |
        ±-+ ±-+[/font]
Could someone give me an answer which method is more fast?
Thank you very much.