Register/SMEM Usage with different -arch=sm_xx not consistent..

Hi all…

During my previous experience with CUDA… I found that with -arch=sm_13 option on your register usage goes up by 2 to 5 registers… without me ever using double precision in the kernel… that time I was not sure… nor I had the time to investigate it…

but I recently started working on a new project and I thought of settling this matter… hence I wrote a small dumb kernel below:

__global__ void test(float *A, float *B, float *C, int sz)

	 {

		 unsigned int i = (blockIdx.x*blockDim.x+threadIdx.x);

		 float val;

	 if(i<sz){

	 val = __logf(__sinf(rsqrtf(((float4*)A)[i].x )));

		 ((float4*)C)[i].x = __expf(val);

		 ((float4*)C)[i].y = rsqrtf(((float4*)A)[i].y );

		 ((float4*)C)[i].z = rsqrtf(((float4*)A)[i].z );

		 ((float4*)C)[i].w = rsqrtf(((float4*)A)[i].w );

		 }

I am able to achieve a bandwidth of ~78 GB/sec on Tesla c1060 (Th max bandwidth of 102)… but my register/smem usage is different when I compile it with different arch=sm_xx options… as below…

architecture {sm_13},{sm_12},{sm_11}

smem = 44

reg  = 6

architecture {sm_10}

smem = 40

reg  = 4

I am quite confused why do we have this difference in register and smem usage even for a very basic kernel like that ?

I maybe missing some trivial point here… I guess… but this makes a big difference in one of my previous codes… where the sm_10 architecture gives me 10-15% performance improvement… due to less register usage , there by increasing my occupancy…

Anyone has clues on why is this happening ? and anyway to correct this ? (one way maybe to use -maxxregcount option…)

Thanks

I have no idea whether it is the case or not, but it wouldn’t be beyond the realms of possibility that the inline expansion of those math library functions you are using is different for different architectures, would it?

Good Call…

It is indeed the math functions… I just checked the reg usage by removing the math functions, they all are consistent now.

So this means… math function of sm_10 use less registers… – hmm something to keep in mind. External Media

Are you 100% sure of that? Which CUDA version do you use, and can you reproduce the sm_10 result?

The register count difference is one thing, but I can’t see how the smem size can be anything other than 44 (with 64-bit pointers) or 32 (32-bit pointers) in this case…

I agree… but it was different for some unknown reason the semes is off by 4 bytes…

I recompiled on of my other kernels… that has same smem and same register usage across all architectures even though it uses… these math functions…

This hints me towards the fact that the optimization which the compiler performs is slightly different which mite account to this change…

I am using nvcc 2.3 on tesla c1060.

Strange… I can’t reproduce your results. I always get 44 smem and 6 regs in your example, even with sm_10 (nvcc -cubin -arch=sm_10)…

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2009 NVIDIA Corporation

Built on Mon_Jul_13_12:35:34_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

If anything, I would expect that a register count difference occurs between SM 1.1 and SM 1.2 (slightly different register allocation policy), or between SM 1.2 and 1.3 (if there are double-precision litterals)… And definitely no smem difference.