How to improve the Perfomance of Loops

My alogrithm meet a very difficult issue. it’s contains three loops—a while and two for. Relative code as following. I have two question:
1.Why dose it show weak perfomance when it in three loops?
2.How do I to resolve this issue?
Thank you for your answer.

A possible way to achieve a speedup is to fully unroll the loops. For this however your matrix size has to be known at compile time. A possible way to achieve full unroll is to use a template function with the matrix size N being a template argument. This assumes however that you are not working with very large matrix sizes.

Earlier versions of CUDA did not accept a template argument as input to the #pragma unroll preprocessor directive, but beginning with CUDA 8 it is possible. See here:

Then for typical expected matrix sizes you could jump into the specialized implementation for that size, using e.g. a switch/case block. For uncommon cases you could use the generic but slower version.

Also run your code through a profiler to discover bad memory access patterns, such as shared memory bank conflicts and uncoalesced memory accesses.