cudaErrorLaunchOutOfResources aka "too many resources requested for launch"

I am getting error 7 (cudaErrorLaunchOutOfResources or “too many resources requested for launch”) for the following config:
GeForce GT 540M with 1GB.
Compute Capability 2.1 (so 1024 threads should be possible, I have other kernels working just fine)
block {1024, 1, 1}
grid {1, 1, 1}
No shared memory
No texture
Arguments to the kernel: (uint32_t , uint32_t , uint32_t , float2* , float2* )
Locals: 2 uint32_t, 4 float2, 1 float
How large is the register file anyway?
It runs fine when I drop the block.x to 512 and double the grid.x.
Many thanks in advance.

Add the command line flag -Xptxas -v to the nvcc invocation to check how many registers the kernel is using.

Note that simply multiplying the number of registers reported by the thread count can underestimate the total register usage, since architecture-specific granularity applies to register allocation. The occupancy calculator that ships with CUDA incorporates this granularity.

From compilation:
ptxas info : Used 49 registers, 64 bytes cmem[0], 8 bytes cmem[14]

From deviceQuery:
Total number of registers available per block: 32768

49 x 1024 = 50176 > 32768 then goto resize block :-)

Thank you very much njuffa!
BTW which manual describes the compiler options?

The nvcc options are documented in CUDA_Compiler_Driver_NVCC.pdf (in the doc/ directory of the CUDA toolkit).