A basic question about cuobjdump: performance of different LDS bit-width

Hi All,

I wish to get your kindly help to understand efficiency of instruction LDS (load from shared memory)

I have a piece of code using a shared memory buffer float A[64].

When the base address of A is 128b aligned, the instruction sequence obtained through cuobjdump --dump-sass is:

LDS.128 R8, [R12+0x100]

LDS.128 R4, [R12+0x110]

When I add a 32b offset to the base address of A, the instruction sequence obtained through cuobjdump --dump-sass is:

LDS.64 R10, [R8+0x108];

LDS R4, [R8+0x104];

LDS.128 R4, [R8+0x110];

LDS R4, [R8+0x120];

(Assume there is no bank conflict in either way).

I wish to understand whether using different LDS bit-wdith for same problem will have different impact on performance. In other word, I wish to confirm when I change the offset of an array definition, whether I should consider the performance side effect.

Thank you very much for reading my question.


Particularly, I have such concern when I add an padding to an array. For example, when A[8][8] becomes A[8][9], similar differences happens for LDS instructions. Is it possible that after adding a padding, the conflict is gone but the increasing of LDS instruction number makes the performance even worse?