I run cudaGetDeviceProperties on the Device I initialize but it gives me really bogus limits on the thread and block maximum dimensions, (the limits are something like 512, 512, 64 for maxThreadsDim and 65536, 65536, 1 for maxGridSize, of course the kernel invocations stop working much sooner with invalidConfiguration error returned.
Those values look right. These are the maximums for each dimension of a block and each dimension of a grid.
Each block may only contain maxThreadsPerBlock threads. Each thread uses a certain number of registers, and there are only regsPerBlock registers per block. These might be the sources of your error.
I don’t know of any additional limits on the number of blocks in a grid, other than the dimension limits.
So should I instead take the register number and delete by the number of registers used by my kernel? If so how can I tell how many registers a given kernel uses?
The CUDA Occupancy Calculator can help in this regard, although I see the spreadsheet hasn’t been updated for Fermi cards yet. That isn’t hard though as you just need to enter the physical limits for Fermi.
You can pass the --ptxas-options=-v to NVCC, which will get the PTX assembler to print everything it does, including the final register count and shared memory for your kernels.