Is it true that threads in a block execute on a single processor — so that one should use a maximum number of blocks (and minimize the number of threads per block) to achieve maximum parallelism?
Or are these pure abstractions that bear no relation to what goes on at the hardware level?
Yes - if by processor you mean multiprocessor, which is an SIMT unit of 8 scalar cores.
No. There are fixed overheads and latencies in 1.0/1.1/1.2/1.3 hardware which require a minimum of 192 active threads per multiprocessor to be amortized. Also the warp size on all hardware is 32 threads, so you should have at least 32 threads per block and at least 6 active warps per multiprocessor to get anything like peak performance. Obviously the number of blocks launched should be a round multiple of the number of MP on a given device.