I am warming up for my thesis in the area of GPGPU and have implemented a gemv. It outperforms cublas with a factor 2 on compute capability 1.3 devices but this bugs my mind: When I compile the code on two different workstations, I get a different register count (10 vs 12). I have no clue why?
Explaining this kind of weirdness is not something I am capable of yet but I would really like to know the reason. I mean, 16% smaller kernel by compiling somewhere else?!
Any suggestions?
The compiler (cl.exe) product versions are
system A 9.00.30729.01 (30-07-2008)
system B 9.00.21022.08 (08-11-2007)
CUDA Toolkit: 2.3, 32 bit on both systems
Kernel:
[codebox]
#define matIdx(a,B ) (a+(B )*sizeN )
global void gemv(float *m, float *v, float *result, int sizeN){
extern __shared__ float smem[];
const int i = threadIdx.x + blockIdx.x*blockDim.x;
float sum = 0;
for(int j = 0; j < sizeN; j += blockDim.x){
//copy vector to shared memory
smem[threadIdx.x] = v[j + threadIdx.x];
__syncthreads();
for(int b = 0; b < blockDim.x; ++b ){
sum += m[matIdx(i,j + b )]*smem[ b];
}
__syncthreads();
}
result[i] = sum;
}
[/codebox]
cubin PC A:
[codebox]
architecture {sm_10}
abiversion {1}
modname {cubin}
code {
name = _Z5gemvPfS_S_i
lmem = 0
smem = 32
<b>reg = 12</b>
bar = 1
bincode {
...
}
}
[/codebox]
cubin PC B:
[codebox]
architecture {sm_10}
abiversion {1}
modname {cubin}
code {
name = _Z5gemvPfS_S_i
lmem = 0
smem = 32
<b>reg = 10</b>
bar = 1
bincode {
...
}
}
[/codebox]