Regiser usage higher when using constant in loop

Hi there!

I have a kernel which I wanted to reduce the number of registers for, but found some strange behavior. Have a look at the following code:

__global__ void multiplyDenseMatrixCenter(float* md, float* v, float* r, float alpha, int bx, int gx)

{

	int k = threadIdx.x + bx + 3;

	int l = blockIdx.x + gx + 3;

	float s = .0f;

	for(int jy = -3;jy <= 3;jy++)

		for(int jx = -3;jx <= min(d_numupX - 1 - k, 3);jx++)

		{

			int iV = k + jx + __mul24(l + jy, d_numupX);

			int iM = k + min(0, jx) + denseIndexJumps[jx + 3] + __mul24(l + min(0, jy) + denseIndexJumps[jy + 3], d_DMSize);

			s += md[iM] * v[iV];

		}

		r[k + __mul24(l, d_numupX)] = s * alpha;

}

This kernel uses 13 registers. Now, if I replace the “min(d_numupX - 1 - k, 3)” with just “3” (this gives the same results because 3 is always smaller than d_numupX - 1 - k), like this:

__global__ void multiplyDenseMatrixCenter(float* md, float* v, float* r, float alpha, int bx, int gx)

{

	int k = threadIdx.x + bx + 3;

	int l = blockIdx.x + gx + 3;

	float s = .0f;

	for(int jy = -3;jy <= 3;jy++)

		for(int jx = -3;jx <= 3;jx++)

		{

			int iV = k + jx + __mul24(l + jy, d_numupX);

			int iM = k + min(0, jx) + denseIndexJumps[jx + 3] + __mul24(l + min(0, jy) + denseIndexJumps[jy + 3], d_DMSize);

			s += md[iM] * v[iV];

		}

		r[k + __mul24(l, d_numupX)] = s * alpha;

}

it suddenly uses 17 registers! How can that be? Can somebody understand this and possibly explain to me why? Afterall, I try to reduce the register pressure, not the opposite.

nvcc --version output:

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2010 NVIDIA Corporation

Built on Thu_Sep__9_16:01:02_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

My OS is Mac OS X 10.6.4.

Thanks in advance!

Hi there!

I have a kernel which I wanted to reduce the number of registers for, but found some strange behavior. Have a look at the following code:

__global__ void multiplyDenseMatrixCenter(float* md, float* v, float* r, float alpha, int bx, int gx)

{

	int k = threadIdx.x + bx + 3;

	int l = blockIdx.x + gx + 3;

	float s = .0f;

	for(int jy = -3;jy <= 3;jy++)

		for(int jx = -3;jx <= min(d_numupX - 1 - k, 3);jx++)

		{

			int iV = k + jx + __mul24(l + jy, d_numupX);

			int iM = k + min(0, jx) + denseIndexJumps[jx + 3] + __mul24(l + min(0, jy) + denseIndexJumps[jy + 3], d_DMSize);

			s += md[iM] * v[iV];

		}

		r[k + __mul24(l, d_numupX)] = s * alpha;

}

This kernel uses 13 registers. Now, if I replace the “min(d_numupX - 1 - k, 3)” with just “3” (this gives the same results because 3 is always smaller than d_numupX - 1 - k), like this:

__global__ void multiplyDenseMatrixCenter(float* md, float* v, float* r, float alpha, int bx, int gx)

{

	int k = threadIdx.x + bx + 3;

	int l = blockIdx.x + gx + 3;

	float s = .0f;

	for(int jy = -3;jy <= 3;jy++)

		for(int jx = -3;jx <= 3;jx++)

		{

			int iV = k + jx + __mul24(l + jy, d_numupX);

			int iM = k + min(0, jx) + denseIndexJumps[jx + 3] + __mul24(l + min(0, jy) + denseIndexJumps[jy + 3], d_DMSize);

			s += md[iM] * v[iV];

		}

		r[k + __mul24(l, d_numupX)] = s * alpha;

}

it suddenly uses 17 registers! How can that be? Can somebody understand this and possibly explain to me why? Afterall, I try to reduce the register pressure, not the opposite.

nvcc --version output:

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2010 NVIDIA Corporation

Built on Thu_Sep__9_16:01:02_PDT_2010

Cuda compilation tools, release 3.2, V0.2.1221

My OS is Mac OS X 10.6.4.

Thanks in advance!

Just a guess, but with known trip counts for both loops, the compiler might well be unrolling the loops during optimization. The ptx code will give a better idea of what is happening.

Just a guess, but with known trip counts for both loops, the compiler might well be unrolling the loops during optimization. The ptx code will give a better idea of what is happening.

Thanks, that was it. A #pragma unroll 1 “fixed” this. The strange thing is that this also happened when building in debug-mode. I only expected the compiler to do optimizations like loop-unrolling when debug-flags are off. Are there other optimizations the compiler performs even in debug builds? Because this is really non-intuitive.

Thanks, that was it. A #pragma unroll 1 “fixed” this. The strange thing is that this also happened when building in debug-mode. I only expected the compiler to do optimizations like loop-unrolling when debug-flags are off. Are there other optimizations the compiler performs even in debug builds? Because this is really non-intuitive.