How to determine register usage

Assume GTX 460 compute compatibility 2.1. I have lots of questions relating to registers.

How do I determine the register usage of the automatic variables in my kernel?
What is the best way to calculate or view register usage?
Per keeping register count low are there any best practices to follow?

Given the following:
global void myKernel(int* Id, int* Jd, int* Kd, char c1, char c2, int i1){//Id, Jd, and Kd point to integer arrays allocated in global memory

...

int bx = blockIdx.x; int by = blockIdx.y;
int bn = bx + (gridDim.x*by);
int tx = threadIdx.x; 

Would it be better from a register usage perspective to allocate the integer arrays Id, Jd, and Kd as just one big array versus three?
Would using an unsigned char for i1 versus an int save 3 bytes per thread of register memory? (assuming a char is sufficient as per the intended use)
bx, by, and tx are just convenient short name holders for BlockIdx.x/y and threadIdx.x. Will the compiler optimize here or am I actually increasing my register usage by doing this?

The kernel argument list is passed in constant memory in compute 2.x devices and uses no registers. Automatic variables come from either shared memory or “special” PTX registers over which the compiler has no influence. The actual register usage of kernel code is avaiable by passing -Xptxas="-v" to nvcc during compilation, which will make the compiler report something like this:

~/fimcode$ nvcc --cubin -Xptxas="-v" -gencode arch=compute_20,code=sm_20 fim.cu -o fim.cubin

ptxas info    : Compiling entry function '_Z10fimIterateIdLj8EEvPKT_PKiPS0_PiS0_S0_S0_jjji' for 'sm_20'

ptxas info    : Used 35 registers, 1956+0 bytes smem, 104 bytes cmem[0], 4 bytes cmem[16]

ptxas info    : Compiling entry function '_Z6fimTagIdLj8EEvPKT_S2_PiS3_jjj' for 'sm_20'

ptxas info    : Used 8 registers, 512+0 bytes smem, 76 bytes cmem[0]

ptxas info    : Compiling entry function '_Z7fimTag2IdLj8EEvPT_PiS2_S0_S0_S0_jjj' for 'sm_20'

ptxas info    : Used 35 registers, 2464+0 bytes smem, 92 bytes cmem[0]

ptxas info    : Compiling entry function '_Z10fimIterateIfLj8EEvPKT_PKiPS0_PiS0_S0_S0_jjji' for 'sm_20'

ptxas info    : Used 23 registers, 1236+0 bytes smem, 92 bytes cmem[0], 8 bytes cmem[16]

ptxas info    : Compiling entry function '_Z6fimTagIfLj8EEvPKT_S2_PiS3_jjj' for 'sm_20'

ptxas info    : Used 8 registers, 512+0 bytes smem, 76 bytes cmem[0]

ptxas info    : Compiling entry function '_Z7fimTag2IfLj8EEvPT_PiS2_S0_S0_S0_jjj' for 'sm_20'

ptxas info    : Used 21 registers, 1744+0 bytes smem, 80 bytes cmem[0], 8 bytes cmem[16]

So in what you posted is each of these a kernel function? And, can I assume if it says “used X registers” that this will correlate with what is reported for shared memory use and what I would know to be the number of threads per block?

So per params being in constant memory, even pointers? And is constant memory access as fast as access to shared?

Yes.

I don’t quite get the question. What you see there is the assembler’s final number of registers allocated per thread (also local memory if any is used), and the final tallies of statically assigned shared memory and constant memory, per block. You still need to account for any dynamically allocated shared memory yourself.

Yes, even pointers.

Pretty close. Constant memory has an on die cache which I assume has the same approximately 1000Gb/s bandwidth and 20-30 cycle latency that the global memory caches and shared memory have. Constant memory has a broadcast mechanism which can hit every thread in a warp in a single transaction, and the argument list is constant for every block in the grid, so you would imagine that constant cache hit ratios will be very high and the overall performance good.