I am seeing that the execution time of my code by increasing the dimention of problem increase, but in stange way. e.g. for 10k the time is 0.8 s. for 20K it is 0.1 s. for 30k is 0.14 but for 40k it will be 3 times more. So 40k will be 0.45 s. . After that for 50k the time will increas but a bit littel e.g 0.47s. . And for 60k also same behaviar.
For that behaviar should I check the usage of shared memory, L1 cache or L2? Or I should think about the number of the cuda core that I need.
How can I detect the reason?