compute capability number of thead limit

I have a 64 bit app running on GTX580 and windows 7 64 bit. I use toolkit 3.2.
When the thread number is set to 600, I got the requesting too much resources error and When the thread number is set to 400, the app runs fine.
I believe GTX580 has compute capability 2.0 which allows for 1024 thread per block, right?
So Do I need to set some parameter to get the 1024 limit? how can I work arount this problem?
Thanks in advance.

It allows for up to 1024 threads per block. Register and shared memory usage can mean the limiting number of threads per block is less than 1024. This is discussed in several places in the documentation. NVIDIA provide an Occupancy calculator spreadsheet you can use to analyse the limits for a given kernel.

The only way to increase the threads per block is to reduce the register and/or shared memory footprint of your code. But you might want to do some benchmarking and see whether there is any benefit from running such large blocks anyway. You might be surprised by the results.

You could even find that your application runs better with 128 or 160 threads per block. The occupancy calculator will help you choose good block sizes.

NB try to keep the threads per block a multiple of 32, avoid if you can using just over a mulitple of 32. By that I mean 31 (days in a month) isn’t bad but 33 (one warp of 32 threads and another warp of just 1 thread) is good to avoid if you can.