Hello, thank you for reading my issue.
I want to check something, but my code is tremendously big and it makes me difficult to convert to other side.
My issue is very simple, I made a program that calculates…
- “One calculation” includes number of serial matrix calculations, having several matrix multiplications and matrix inversions.
- Each matrix have N/M*N/M size, and N is very big(1,000~10,000) but M is very small(up to 16).
- All matrices are stored in the global memory of graphic card.
- Each thread calculates “one calculation”
But as you can expected, it is SO slow…
|ã€€ |-+ --> Each calculation --> Each thread calculates whole matrix
ã€€ ±-+ |ã€€ã€€ã€€ã€€each thread
So I thought I could try another approach like below…
- Each small(but can I call it is “slow”?) calculation in “one calculation” will be calculated from set of threads.
- One calculation will be executed in single kernel execution, so only 1 of “one calculation” could executed in one GPU in same time.
- Each small calculation will be synchronized so serial matrix calculations could be achieved.
This method is still not implemented
|ã€€ã€€ã€€ |ã€€ã€€ã€€|ã€€| |ã€€|
|ã€€ã€€ã€€ |ã€€ã€€ã€€±-+ ±-+
|ã€€ã€€ã€€ | --> A matrix divided into --> Threads calculate each submatrices
±-----+ã€€ã€€ã€€±-+ ±-+ submatrices
ã€€ã€€ã€€ã€€ã€€ã€€ã€€ |ã€€| |ã€€ |
ã€€ã€€ã€€ã€€ã€€ã€€ã€€ ±-+ ±-+[/font]
Could someone give me an answer which method is more fast?
Thank you very much.