Register count varies? Compiling on different systems gives different register count?

I am warming up for my thesis in the area of GPGPU and have implemented a gemv. It outperforms cublas with a factor 2 on compute capability 1.3 devices but this bugs my mind: When I compile the code on two different workstations, I get a different register count (10 vs 12). I have no clue why?

Explaining this kind of weirdness is not something I am capable of yet but I would really like to know the reason. I mean, 16% smaller kernel by compiling somewhere else?!

Any suggestions?

The compiler (cl.exe) product versions are

system A 9.00.30729.01 (30-07-2008)

system B 9.00.21022.08 (08-11-2007)

CUDA Toolkit: 2.3, 32 bit on both systems

Kernel:

[codebox]

#define matIdx(a,B ) (a+(B )*sizeN )

global void gemv(float *m, float *v, float *result, int sizeN){

extern __shared__ float smem[];

const int i = threadIdx.x + blockIdx.x*blockDim.x;

float sum = 0;

for(int j = 0; j < sizeN; j += blockDim.x){

	//copy vector to shared memory

	smem[threadIdx.x] = v[j + threadIdx.x];

	__syncthreads();

	for(int b = 0; b < blockDim.x; ++b ){

		sum += m[matIdx(i,j + b )]*smem[ b];

	}

	__syncthreads();

}

result[i] = sum;

}

[/codebox]

cubin PC A:

[codebox]

architecture {sm_10}

abiversion {1}

modname {cubin}

code {

name = _Z5gemvPfS_S_i

lmem = 0

smem = 32

<b>reg  = 12</b>

bar  = 1

bincode {

	...

}

}

[/codebox]

cubin PC B:

[codebox]

architecture {sm_10}

abiversion {1}

modname {cubin}

code {

name = _Z5gemvPfS_S_i

lmem = 0

smem = 32

<b>reg  = 10</b>

bar  = 1

bincode {

	...

}

}

[/codebox]

Are you sure the nvcc/nvopencc versions are the same? The host C compiler doesn’t play a role in determining device code register usage, that should be down to the ptx emitted by the device compiler, and ptx assembly done by the device assembler.

If you are interested in gemv(), you might want to have a look at this thread. I did a little bit of work on optimizing a kernel for solving r=Ax-b, which also performed a lot better than cublas. I would be interested to see the relative performance of our two kernels.

Aha! On system B I have to ‘cheat’ a bit in order to compile CUDA code; the administrators of the university kindly installed the 64-bit toolkit on workstations with only the 32-bit compilers installed. In order to be able to compile, I therefore copy (as I can’t install) the 32-bit binaries into the bin folder of the 64-bit toolkit (hence use nvopencc that comes with the other kit).

It appears that the toolkits optimize sligthly different:

32-bit only:

928143 instructions

12 registers in kernel

hacky 32-bit injection:

929944 instructions

10 registers in kernel

BUT 7% slower execution time

Nice to know! Thanks =)

(I wonder if kernels in 64-bit programs will gain from injecting 64-bit binaries into the bin64 folder of the 32-bit toolkit…?)

When you say solve, I bet you mean compute ;-) I’ll check it out!