Optimization question

Hi,

Im having trouble understanding why the first code section adds 16 extra registers and runs x2 the time the second code section.

Of course I need the first code section to work and not the second one :)

qq = 0;

for ( int i = 0; i < 5; i++ )

{

  // The qq thing causes extra 16 registers for some reason and doubles the time the kernel runs !!!!!!

  sumCell = fVal1 * smWeights[qq++] + fVal2 * smWeights[qq++] + fVal3 * smWeights[qq++] + fVal4 * smWeights[qq++];

  smResults[ threadIdx.x + i * MAX_THREADS_PER_BLOCK ] += sumCell;

}
for ( int i = 0; i < 5; i++ )

{

  sumCell = fVal1 * smWeights[0] + fVal2 * smWeights[1] + fVal3 * smWeights[2] + fVal4 * smWeights[3];

  smResults[ threadIdx.x + i * MAX_THREADS_PER_BLOCK ] += sumCell;

}

It might be a good idea to break it down to small pieces to figure out where the register allocation is coming from. Start out with something like

this and comment out pieces of code. Also I don’t think it is a good idea to do qq++ multiple times in the same statement because it is not guaranteed which

qq++ will happen first.

[codebox]

qq = 0;

for ( int i = 0; i < 5; i++ )

{

sumCell = fVal1 * smWeights[qq++];

sumCell += fVal2 * smWeights[qq++];

sumCell += fVal3 * smWeights[qq++];

sumCell += fVal4 * smWeights[qq++];

smResults[ threadIdx.x + i * MAX_THREADS_PER_BLOCK ] += sumCell;

}

[/codebox]

I’m fairly new to all of this, but I think I’ll take a stab at it.

The for loop runs four times, and in each iteration of that loop, four values of smWeights gets ‘loaded’ into a register by accessing those values. So 4*4 = 16, and that’s where the 16 registers comes from. In the 2nd for loop, you’re accessing the same smWeights elements in each iteration, so there isn’t the additional load times from memory.

These two loops are VERY different. In the first for loop, you are accessing the values from smWeights[0] to smWeights[15], inclusively. In the second for loop, you are only ever accessing the values from smWeights[0] to smWeights[3], inclusively.

Hi,

Ok maybe I omitted some stuff :) the smWeights is a shared memory var of this type: shared float smWeights[ 5 * 4 ]; and was loaded with values before the loop.

smResults is also a shared memory variable.

qq is just a temporary value so each thread will get its instance.

I’ve already tried to break the code into lines and even unroll the loop and do something like this:

sumcell = fVal1 * smWeights[0] + fVal2 * smWeights[1] + fVal3 * smWeights[2] + fVal4 * smWeights[3];

smResults[ threadIdx.x + i * MAX_THREADS_PER_BLOCK ] += sumCell;

sumcell = fVal1 * smWeights[4] + fVal2 * smWeights[5] + fVal3 * smWeights[6] + fVal4 * smWeights[7];

smResults[ threadIdx.x + i * MAX_THREADS_PER_BLOCK ] += sumCell;

sumcell = fVal1 * smWeights[8] + fVal2 * smWeights[9] + fVal3 * smWeights[10] + fVal4 * smWeights[11];

smResults[ threadIdx.x + i * MAX_THREADS_PER_BLOCK ] += sumCell;

sumcell = fVal1 * smWeights[12] + fVal2 * smWeights[13] + fVal3 * smWeights[14] + fVal4 * smWeights[15];

smResults[ threadIdx.x + i * MAX_THREADS_PER_BLOCK ] += sumCell;

sumcell = fVal1 * smWeights[16] + fVal2 * smWeights[17] + fVal3 * smWeights[18] + fVal4 * smWeights[19];

smResults[ threadIdx.x + i * MAX_THREADS_PER_BLOCK ] += sumCell;

Seems each sumCell = … line uses 4 registers. I dont understand why? its already in shared memory why the registers and x2 performance downgrade?

thanks

eyal

If all your threads go through the loop you might want to try adding a __syncthreads() at the end of the loop, that can help reduce registers in some cases. Also a #pragma unroll 5 would be useful since you appear to go over it a fixed number of times.

Hi,

I’ve manually unrolled the loop, please see the previous post. Still it adds the registers and loose performance for some reason.

eyal

Have you tried simplifying your statements yet? Do each computation in a separate statement. Maybe you will see a pattern.

Yes I did as can be seen in the above post. Each line contributes 4 registers * 4 lines == 16 registers.

Did you read my first post?

How do you debug code? Break it down into smaller pieces.

You talk about “sumcell =” line taking 4 registers…what about the smResults line?

Example of what I am talking about:

[codebox]

sumcell = fVal1 * smWeights[0];

sumcell += fVal2 * smWeights[1];

sumcell += fVal3 * smWeights[2];

sumcell += fVal4 * smWeights[3];

// By the way, what is i ? Maybe should be 0 here

smResults[ threadIdx.x + i * MAX_THREADS_PER_BLOCK ] = sumCell;

sumcell = fVal1 * smWeights[4];

sumcell += fVal2 * smWeights[5];

sumcell += fVal3 * smWeights[6];

sumcell += fVal4 * smWeights[7];

// i should be 1

smResults[ threadIdx.x + i * MAX_THREADS_PER_BLOCK ] = sumCell;

[/codebox]