During my previous experience with CUDA… I found that with -arch=sm_13 option on your register usage goes up by 2 to 5 registers… without me ever using double precision in the kernel… that time I was not sure… nor I had the time to investigate it…
but I recently started working on a new project and I thought of settling this matter… hence I wrote a small dumb kernel below:
__global__ void test(float *A, float *B, float *C, int sz)
{
unsigned int i = (blockIdx.x*blockDim.x+threadIdx.x);
float val;
if(i<sz){
val = __logf(__sinf(rsqrtf(((float4*)A)[i].x )));
((float4*)C)[i].x = __expf(val);
((float4*)C)[i].y = rsqrtf(((float4*)A)[i].y );
((float4*)C)[i].z = rsqrtf(((float4*)A)[i].z );
((float4*)C)[i].w = rsqrtf(((float4*)A)[i].w );
}
I am able to achieve a bandwidth of ~78 GB/sec on Tesla c1060 (Th max bandwidth of 102)… but my register/smem usage is different when I compile it with different arch=sm_xx options… as below…
architecture {sm_13},{sm_12},{sm_11}
smem = 44
reg = 6
architecture {sm_10}
smem = 40
reg = 4
I am quite confused why do we have this difference in register and smem usage even for a very basic kernel like that ?
I maybe missing some trivial point here… I guess… but this makes a big difference in one of my previous codes… where the sm_10 architecture gives me 10-15% performance improvement… due to less register usage , there by increasing my occupancy…
Anyone has clues on why is this happening ? and anyway to correct this ? (one way maybe to use -maxxregcount option…)
I have no idea whether it is the case or not, but it wouldn’t be beyond the realms of possibility that the inline expansion of those math library functions you are using is different for different architectures, would it?
Are you 100% sure of that? Which CUDA version do you use, and can you reproduce the sm_10 result?
The register count difference is one thing, but I can’t see how the smem size can be anything other than 44 (with 64-bit pointers) or 32 (32-bit pointers) in this case…
Strange… I can’t reproduce your results. I always get 44 smem and 6 regs in your example, even with sm_10 (nvcc -cubin -arch=sm_10)…
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2009 NVIDIA Corporation
Built on Mon_Jul_13_12:35:34_PDT_2009
Cuda compilation tools, release 2.3, V0.2.1221
If anything, I would expect that a register count difference occurs between SM 1.1 and SM 1.2 (slightly different register allocation policy), or between SM 1.2 and 1.3 (if there are double-precision litterals)… And definitely no smem difference.