Old fashioned profiling and benchmarking. There is no panacea or universally correct numbers. Your matrix add code and mine might be different and require different execution parameters for optimal performance. NVIDIA supply an occupancy calculation spreadsheet you can play around with to get a feel for how the GPU will schedule your kernel for different execution parameters.
You heard wrong. The per block limit is 512 on all current hardware. The thread scheduling hardware connected to each Multiprocessor in CUDA capable GPUs has a maximum of either 768 or 1024 threads, depending on which hardware you have. Those up to 768/1024 threads come from multiple blocks, if other resources like registers and shared memory permit. Only 32 threads are executed simultaneously on any given Multiprocessor (groups of 32 are called “warps” in CUDA terminology, and the warp is the basic scheduling unit). Other active threads are either queued for execution, or stalled for pending memory transactions or instruction fetches. All of this is discussed in some detail in chapters 4 and 5 of the programming guide.
That quotation doesn’t contradict a word I wrote. On compute capability 1.1 hardware, the limit is 768 active/concurrent threads per multiprocessor (not per block like you wrote), which are executed in SIMD fashion in warps of 32 threads. Up to 512 threads per block, maximum of 8 active/concurrent blocks per multiprocessor, with a total of 8192 registers and 16kb share memory per mutliprcoessor. Blocks are scheduled at the multiprocessor level, threads at the streaming processor level, but in groups (“warps”).
This all comes straight from appendix A of the programming guide. I invite you to read it for yourself if you are still skeptical.