Kernel Execution problem

When i initialize my threads-per-block using a dim3 variable bigger than a 16x16 the kernel will not launch. I have a GTX 570 and the device query tells me i have 1024 threads per block. But, I can only use 256? Why is this?

How many registers does your kernel require per thread? You can add --ptxas-options=-v to your nvcc command line to get this information printed. It is possible that there are not enough registers to run more than 256 threads per block.

i actually add that line to command line properties in VS 2010 but it will not compile

I was able to get it to compile and it outputs

ptxas : info : Used 52 registers, 184 bytes cumulative stack size, 52 bytes cmem[0]

Compute capability 2.0 has 32768 registers per multiprocessor, so the maximum number of threads you could execute this kernel with is something less than 630 threads. (There is some bank structure to the registers, so you can’t do exactly 630 in this case.)

A block dimension of 16x32 should work, for example.

Check out the occupancy calculator spread sheet which incorporates the register allocation granularity rules. There are typically multiple per architecture, and they differ between architecures, you would not want to track that manually, which is why the spreadsheet is provided.