Unexpected LDG operations

@ njuffa maybe?

I have a simple section of code which fills a block local shared array:

     __shared__ uint32_t sh_sout[THREADS * (32 + 1)];
    uint32_t *sout = &sh_sout[threadIdx.x * (32 + 1)]; 

I then copy the contents to a large global array after XORing with a constant:

device uint8_t Sout[0x1000000];

for (i = 0; i < 32; i++){
	Sout[(tid * 32) + i] = sout[i] ^ x[8];

Checking the resulting SASS, I’m perplexed by the LDG operations that accompany each STG:

1

By my interpretation, the address of x[8] is loaded into R80, the LDG puts the contents into R70, which is XORed with R4, which I assume holds sout[0] and then the result is STG [R52]. This pattern repeats for the remaining 31 results.

I’d have thought it’d have been more efficient to place the contents of x[8] into R80 then XOR against it repeatedly.

But what do I know… :)

Thanks to anyone for enlightenment.

__device__ is used to define a device global object. The value will be loaded via LD or LDG.

__constant__ is used to define a device constant. The compiler can log via LDG, LDC, or constant reference in many instructions.

Thanks Greg. As I mentioned at the end, I’d have thought it’d have been more efficient to do just the first LDG in order to place the contant value in the register, thereby removing the need for the other 31, as it would just be a case of XORing against that.

I’m sure the practical outcome is the same due to caching, but it just struck me as odd.

It’s hard to diagnose such things without code to compile. Are you

(1) Looking at the output of a release build with full optimization?
(2) Using __restrict__ for pointer arguments?

C++ optimizers generally follow an “as-if”-rule: Generated code must behave as if it was executing on the abstract machine defined by language semnatics. In order to “cache” data at a memory location into a register, it needs to be sure that data cannot be overwritten through another pointer, i.e. there is no aliasing.

[1] Yes.
[2] No.

But the result of changing [2] to Yes has been very instructive. For this section:

without restrict there were 122 active regs and 3,163,112 ins executed, which dropped to 110 and 1,179,648, with restrict added. The code generated then followed almost exactly what I expected - 1 LDG and 32 STG.

Unfortunately, for reasons I’ve not yet worked out, the performance with restrict dropped 24%.

Edit: Comment on L1 cache usage of another kernel with and without restrict removed due to error - too many kernels open at once in Nsight Compute… :(

Very helpful, thank you.