Array initialization - Bug or feature

Consider the following 4 variants to intialize an array with a constant size defined via a #define statement.

#define FAC   4

extern "C" __global__ void mykernel1 (void)

{

  // This is not handled well by the compiler

  double C[FAC*FAC] = {0.0};

}

extern "C" __global__ void mykernel2 (void)

{

  double C[FAC*FAC];

#pragma unroll

  for(int i=0; i< FAC*FAC; ++i)

    C[i] = 0.0;

}

extern "C" __global__ void mykernel3 (void)

{

  // Not generic

  double C[FAC*FAC] = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0};

}

extern "C" __global__ void mykernel4 (void)

{

  // Missing one value.

  double C[FAC*FAC] = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0};

}

The assembler output from

nvcc --maxrregcount 63 -arch sm_20 --ptxas-options="-v" --cubin bug.cu

cuobjdump --dump-sass bug.cubin > bug.ass

is

code for sm_20

		Function : mykernel1

	/*0000*/     /*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];

	/*0008*/     /*0x00105d034800c002*/ 	IADD R1, R1, -0x80;

	/*0010*/     /*0x001fdca5c8000000*/ 	STL.64 [R1], RZ;

	/*0018*/     /*0x201fdca5c8000000*/ 	STL.64 [R1+0x8], RZ;

	/*0020*/     /*0x401fdca5c8000000*/ 	STL.64 [R1+0x10], RZ;

	/*0028*/     /*0x601fdca5c8000000*/ 	STL.64 [R1+0x18], RZ;

	/*0030*/     /*0x801fdca5c8000000*/ 	STL.64 [R1+0x20], RZ;

	/*0038*/     /*0xa01fdca5c8000000*/ 	STL.64 [R1+0x28], RZ;

	/*0040*/     /*0xc01fdca5c8000000*/ 	STL.64 [R1+0x30], RZ;

	/*0048*/     /*0xe01fdca5c8000000*/ 	STL.64 [R1+0x38], RZ;

	/*0050*/     /*0x001fdca5c8000001*/ 	STL.64 [R1+0x40], RZ;

	/*0058*/     /*0x201fdca5c8000001*/ 	STL.64 [R1+0x48], RZ;

	/*0060*/     /*0x401fdca5c8000001*/ 	STL.64 [R1+0x50], RZ;

	/*0068*/     /*0x601fdca5c8000001*/ 	STL.64 [R1+0x58], RZ;

	/*0070*/     /*0x801fdca5c8000001*/ 	STL.64 [R1+0x60], RZ;

	/*0078*/     /*0xa01fdca5c8000001*/ 	STL.64 [R1+0x68], RZ;

	/*0080*/     /*0xc01fdca5c8000001*/ 	STL.64 [R1+0x70], RZ;

	/*0088*/     /*0xe01fdca5c8000001*/ 	STL.64 [R1+0x78], RZ;

	/*0090*/     /*0x00001de780000000*/ 	EXIT;

		..........................

		Function : mykernel2

	/*0000*/     /*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];

	/*0008*/     /*0x00001de780000000*/ 	EXIT;

		..........................

		Function : mykernel3

	/*0000*/     /*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];

	/*0008*/     /*0x00001de780000000*/ 	EXIT;

		..........................

		Function : mykernel4

	/*0000*/     /*0x00005de428004404*/ 	MOV R1, c [0x1] [0x100];

	/*0008*/     /*0x00105d034800c002*/ 	IADD R1, R1, -0x80;

	/*0010*/     /*0x001fdca5c8000000*/ 	STL.64 [R1], RZ;

	/*0018*/     /*0x201fdca5c8000000*/ 	STL.64 [R1+0x8], RZ;

	/*0020*/     /*0x401fdca5c8000000*/ 	STL.64 [R1+0x10], RZ;

	/*0028*/     /*0x601fdca5c8000000*/ 	STL.64 [R1+0x18], RZ;

	/*0030*/     /*0x801fdca5c8000000*/ 	STL.64 [R1+0x20], RZ;

	/*0038*/     /*0xa01fdca5c8000000*/ 	STL.64 [R1+0x28], RZ;

	/*0040*/     /*0xc01fdca5c8000000*/ 	STL.64 [R1+0x30], RZ;

	/*0048*/     /*0xe01fdca5c8000000*/ 	STL.64 [R1+0x38], RZ;

	/*0050*/     /*0x001fdca5c8000001*/ 	STL.64 [R1+0x40], RZ;

	/*0058*/     /*0x201fdca5c8000001*/ 	STL.64 [R1+0x48], RZ;

	/*0060*/     /*0x401fdca5c8000001*/ 	STL.64 [R1+0x50], RZ;

	/*0068*/     /*0x601fdca5c8000001*/ 	STL.64 [R1+0x58], RZ;

	/*0070*/     /*0x801fdca5c8000001*/ 	STL.64 [R1+0x60], RZ;

	/*0078*/     /*0xa01fdca5c8000001*/ 	STL.64 [R1+0x68], RZ;

	/*0080*/     /*0xc01fdca5c8000001*/ 	STL.64 [R1+0x70], RZ;

	/*0088*/     /*0xe01fdca5c8000001*/ 	STL.64 [R1+0x78], RZ;

	/*0090*/     /*0x00001de780000000*/ 	EXIT;

		..........................

The needless usage of local memory kills the performance of my kernel completely.

Is this a known feature or bug? Thanks.

I think a similar question was raised in

local memory array initialization

BUG: Array initialization does not conform to C standard

some time ago.

What do you expect to happen? You’ve asked for an array in local memory and you get one. Depending on the exact code of the individual kernels, the compiler is able to deduce in some cases that the array is not used at all.
Are you asking why the compiler doesn’t optimize away the array in all cases?

Mhmm. I was under the impression that the array would be placed in registers. At least that was what I intended to.
I simplified a rather huge problem that in the end led to register spills due to the initialization.

Registers cannot be indexed. So the compiler will only place the array in registers if all accesses use fixed indexes. Loop unrolling can help to achieve this.

Thanks Tera,

this was the piece of information I was missing. Now I understand the Programming Guide. :-)

So the loop unrolling saved me as it lead to constant indices (known at compile time).

So what did you solve the problem ? Did your original code already solve the problem ? There was a pragma roll directive there ?

Hi Skybuck,

I have a routine where data was accessed with a fixed tile size. I tried to make

it a constant parameter to change it more easily and did not want to

change the initialization of the register arrays all the time.

So I figured changing the initialization from

double C[16] = {0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0 };

...

to

#define FAC  4

...

 double C[FAC*FAC] = {0};

...

what should have worked. BUT now the array was not placed in registers due to the initialization.

Once I changed this to

#define FAC 4

  double C[FAC*FAC];

#pragma unroll

  for(int i=0; i< FAC*FAC; ++i)

   C[i] = 0.0;

...

everything was working normal again.

Ok, so you posted four kernels with their cuobjdump disassemblies.

Which of the kernels was correct ? Or where they all wrong ?

If they all wrong could you post a correct new one ? ;) =D

None of the above kernels make any sense. BUT my point was that changing

the valid initialization of the array from

to the equaly valid C-construct

killed the performance of my kernel because the array was now placed in local memory instead of

being placed in registers.

Hmm… I don’t quite understand the disassembly… what part of it shows the local ?

The STL is a store to local memory opcode.

Ok, kernel 2 and kernel 3 seem to be a bit weird… they do not have the “STL” instructions ? Why is that ? Where those illiminated by the compiler ? Only a single register seems to be initialized ? while your array seems to be of at least size 16 (4x4)…