Hello, thank you for reading my issue.

I want to check something, but my code is tremendously big and it makes me difficult to convert to other side.

My issue is very simple, I made a program that calculates…

- “One calculation” includes number of serial matrix calculations, having several matrix multiplications and matrix inversions.
- Each matrix have N/M*N/M size, and N is very big(1,000~10,000) but M is very small(up to 16).
- All matrices are stored in the global memory of graphic card.
- Each thread calculates “one calculation”

But as you can expected, it is SO slow…

[font=“Courier New”]±-+

|ã€€ |-+ --> Each calculation --> Each thread calculates whole matrix

±-+ã€€|-+ã€€ã€€ã€€ã€€divided into

ã€€ ±-+ |ã€€ã€€ã€€ã€€each thread

ã€€ã€€ ±-+[/font]

So I thought I could try another approach like below…

- Each small(but can I call it is “slow”?) calculation in “one calculation” will be calculated from set of threads.
- One calculation will be executed in single kernel execution, so only 1 of “one calculation” could executed in one GPU in same time.
- Each small calculation will be synchronized so serial matrix calculations could be achieved.

This method is still not implemented

[font=“Courier New”]

±-----+ã€€ã€€ã€€±-+ ±-+

|ã€€ã€€ã€€ |ã€€ã€€ã€€|ã€€| |ã€€|

|ã€€ã€€ã€€ |ã€€ã€€ã€€±-+ ±-+

|ã€€ã€€ã€€ | --> A matrix divided into --> Threads calculate each submatrices

±-----+ã€€ã€€ã€€±-+ ±-+ submatrices

ã€€ã€€ã€€ã€€ã€€ã€€ã€€ |ã€€| |ã€€ |

ã€€ã€€ã€€ã€€ã€€ã€€ã€€ ±-+ ±-+[/font]

Could someone give me an answer which method is more fast?

Thank you very much.