Dumb question on threads per block

I have a system with C1060’s and I’m trying to convert the “template” project into my own routine. The problem I’m having is that the runtime gives me an error if I request more than 512 threads per block, but the data sheets say I should have 1024 threads per block. I have force the compiler to use sm_13 with the line

CUFILES_sm_13 := Lebesgue_genpower.cu

which does what I expect with nvcc:

/usr/local/cuda/bin/nvcc -o obj/x86_64/release/Lebesgue_genpower.cu_sm_13.o -c Lebesgue_genpower.cu --compiler-options -fno-strict-aliasing -I. -I/usr/local/cuda/include -I…/…/common/inc -I…/…/…/shared//inc -DUNIX -O2 -arch sm_13 …

How do I tell the runtime which compute level to use? I thought the cudaSetDevice() would set that up automatically.

The 1024 number is threads per multiprocessor, not threads per block. The threads per block limit is 512. A full list of the hardware resource limits is given in Appendix A of the programming guide.

Thanks, I knew it was a dumb question :-)