The name “CUDA core” or the older name “streaming processor (SP)” is kind of misleading. A CUDA core is not like a CPU core at all, since it has no separate registers or instruction scheduler. Rather, a CUDA core is more like a fancy pipelined arithmetic unit.
The multiprocessor is the entity that does instruction scheduling, manages the shared memory, cache, and register file. You should think of the multiprocessor like a CPU core that executes vector instructions of length 32 (i.e. the warp size). The number of CUDA cores per multiprocessor gives you an idea of the rate at which the multiprocessor can execute these warp instructions. With only 8 CUDA cores per MP, the compute capability 1.x devices could finish a warp instruction every 4 clocks. With compute capability 2.0, the 32 CUDA cores per MP were split into two groups of 16. Each of these groups can complete a warp instruction every 2 clock cycles, and two instruction scheduler units are used to feed these two groups of 16 CUDA cores. Double precision instructions required all 32 cores to work together on one warp. With the release of compute capability 2.1, each MP gets 3 groups of 16 CUDA cores, but no increase in the number of instruction schedulers. Instead, a scheduler can decide to issue a second independent instruction from one warp into the third group of 16 CUDA cores. This is not always possible, hence the compute throughput per MP of compute capability 2.1 is usually somewhere between 1.0 and 1.5 times that of capability 2.0.
“Threads” in CUDA are nothing like CPU threads, so it is not useful (in my opinion) to talk about thread concurrency beyond the definition used for occupancy: how many warps are actively available to the instruction schedulers on the MP. More warps means more opportunities to hide global memory latency, but may also affect your memory access patterns, so you have to benchmark many options if you really want to be sure you are getting maximum throughput. (I usually stick with 256 threads per block unless I get to the point where squeezing every last bit of speed matters.)