Cuda driver props cudaGetDeviceProperties

I run cudaGetDeviceProperties on the Device I initialize but it gives me really bogus limits on the thread and block maximum dimensions, (the limits are something like 512, 512, 64 for maxThreadsDim and 65536, 65536, 1 for maxGridSize, of course the kernel invocations stop working much sooner with invalidConfiguration error returned.

What am I doing wrong?

Those values look right. These are the maximums for each dimension of a block and each dimension of a grid.

Each block may only contain maxThreadsPerBlock threads. Each thread uses a certain number of registers, and there are only regsPerBlock registers per block. These might be the sources of your error.

I don’t know of any additional limits on the number of blocks in a grid, other than the dimension limits.

So should I instead take the register number and delete by the number of registers used by my kernel? If so how can I tell how many registers a given kernel uses?

The CUDA Occupancy Calculator can help in this regard, although I see the spreadsheet hasn’t been updated for Fermi cards yet. That isn’t hard though as you just need to enter the physical limits for Fermi.

You can pass the --ptxas-options=-v to NVCC, which will get the PTX assembler to print everything it does, including the final register count and shared memory for your kernels.