Edit: Oops, clicked wrong button. Time to write an actual post. :)
Edit #2: Despite the unfortunate name “CUDA core”, you should not try to think of a CUDA core like a standard processor core. Think of them as fancy ALUs. CUDA cores do not have exclusive registers, or instruction decoders. Those resources are managed at the multiprocessor level.
As a result, a CUDA core does not “own” a thread in the same way that a CPU core “owns” a host thread until the operating system switches it out for another one at the next time slice. On a CUDA device, the cores switch between threads instruction-by-instruction in a coordinated fashion to execute whatever warps are available. There is no overhead in switching (since all the registers are outside the CUDA cores and allocated statically to each thread for the duration of execution), so the architecture encourages the use of far more threads than CUDA cores.
Given that, you should think of the scheduling as a two stage process. When you launch a kernel, you create a bunch of blocks (each containing a fixed number of threads) that are thrown into a bucket. As multiprocessor resources become available, blocks are sent to them for execution, where they will stay until completion. At the multiprocessor level, there is a bucket of available warps drawn from all the blocks currently running on that multiprocessor. The multiprocessor scheduler (or more than one scheduler, depending on hardware) repeatedly pulls out warps from the bucket, and sends them off to a group of CUDA cores (8 or 16, depending on compute capability) to execute the next instruction for the entire warp of 32 threads, before returning the warp back to the bucket. The CUDA cores are pipelined, so many warps are at various stages of executing an instruction in any given clock cycle.
In light of that, neither scenario #1 nor #2 are an accurate description of what is going on.
mykernel<<<1, 1024>>> creates a single block with 1024 threads. That block will be assigned to a single multiprocessor, that will execute the 1024/32 = 32 warps in the block using all the available CUDA cores in that one multiprocessor. The rest of the GPU multiprocessors will be unused. On a GTX 580, for example, 15/16 of the GPU will be idle.
mykernel<<<2048, 1>>> creates 2048 blocks with 1 thread each. Depending on the resource usage of each block, one or more blocks will be sent to each multiprocessor. In the most favorable situation, 8 blocks will be sent to each multiprocessor. Again, in the case of the GTX 580, that would mean that there would be 128 blocks sent out in the first wave, with the remaining blocks left unstarted until slots opened up on multiprocessors that finished a block.
On each multiprocessor, the 1 thread per block would be scheduled as 1 warp with 31 empty slots. The warps would be sent to groups of 16 CUDA cores (in the case of Fermi-class GPUs, like the GTX 580), with CUDA cores sitting idle during clock cycles where they would normally be processing the other 31 threads in a full warp. So unlike situation #1, in situation #2 all multiprocessors will have something to do, but the CUDA cores within the multiprocessor will be doing nothing 31/32 of the time.