Different Performance but same ptx Code

Hi there,

I have implemented two versions of the same algorithm:

Version 1:

  • works on a 1-dim array arr[x*y]
for (a = 0; a < rows; a++)

 for (i = 0; i < cols; i++)

  arr[i * cols + a] = 0;

Version 2:

  • works on a 2-dim array arr[y]
for (a = 0; a < rows; a++)

 for (i = 0; i < cols; i++)

  arr[a][i] = 0;

The Nsight Analyser reveals that version 2 performs 30% faster than version 1.

I looked into the ptx code as I assumed this would give me an idea where the performance boost of version 2 comes from.

Surprisingly, the two ptx files differ only in some very minor points (unsigned here, signed there…)

Does anybody have a hint where to look for the reason this behavior?



Where is arr stored?

The equivalent to version 2 would be this version 3:

for (a = 0; a < rows; a++)

 for (i = 0; i < cols; i++)

  arr[a * cols + i] = 0;

Hi there,

sorry for the late reply.

arr is stored in shared memory.

Good point! ;-)
The code shown here is only an abstraction of my real code. I’m currently checking to see if I did the same mistake in my real code…

Best regards and thanks for your help,

hi sietsch,

version 3 from tera should be the version 1 from you equivalent to version 2 from you. Your indexing is wrong in the first code snippet. Please try your test again after correcting is done.

Are the variables rows and cols static, defines or may they change during execution (multiple calls of that loop)?


Hi there,

yes, I found the same error as pointed out by tera and mneubert in my real code.
I fixed that and now, both versions run at the same speed.

Thanks for your help!!!