i have a question about why my kernel is slower at certain sizes of my problem. For instance if i try my kernel for the size 512512 and then 768x768 the first one is slower than the second one. Its the same always when the size of my problem is (n512)x(n*512). The time graph is linear with anomalies at the mentioned sizes. The anomalies are not huge, but they are not statistical erros and i am wondering why this happens.
The kernel only reads a few values, multiplies them and save a value back to the global memory.
Can anyone help why me why does this happen. thx