I have implemented two versions of the same algorithm:
- works on a 1-dim array arr[x*y]
for (a = 0; a < rows; a++) for (i = 0; i < cols; i++) arr[i * cols + a] = 0;
- works on a 2-dim array arr[y]
for (a = 0; a < rows; a++) for (i = 0; i < cols; i++) arr[a][i] = 0;
The Nsight Analyser reveals that version 2 performs 30% faster than version 1.
I looked into the ptx code as I assumed this would give me an idea where the performance boost of version 2 comes from.
Surprisingly, the two ptx files differ only in some very minor points (unsigned here, signed there…)
Does anybody have a hint where to look for the reason this behavior?