Using <<<...>>>

I have several questions on the <<<…>>> format:

  1. Would something like <<<1,1024>>> be a good thing to say?
  2. How does CUDA handle GPUs of different numbers of cores?
  3. Please explain to me what thread blocks are, how to implement them, and why they are useful.
  1. No, you want both thread-level and block-level parallelism to fully utilize the GPU
  2. Blocks get serialized if there are more blocks than can run concurrently. So a kernel just takes longer to run on a GPU with fewer multiprocessors / cores.
  3. This question essentially boils down to explaining most of CUDA. The Programming Guide does a much better job at this than I ever could. It is essential reading for anyone interested in CUDA.

I’ve read through the beginning of the guide. Please correct me if I’m wrong:

  1. <<<1, 1024>>> would basically make the function run on one core 1024 times.
  2. <<<2048, 1>>> means if there were 2048 cores on the GPU, each would run the function once, but if there were only 256 cores, each core would run it 8 times.

Also, does each CUDA core execute a single thread, or does each core contain even smaller things inside that execute the threads?

Edit: Oops, clicked wrong button. Time to write an actual post. :)

Edit #2: Despite the unfortunate name “CUDA core”, you should not try to think of a CUDA core like a standard processor core. Think of them as fancy ALUs. CUDA cores do not have exclusive registers, or instruction decoders. Those resources are managed at the multiprocessor level.

As a result, a CUDA core does not “own” a thread in the same way that a CPU core “owns” a host thread until the operating system switches it out for another one at the next time slice. On a CUDA device, the cores switch between threads instruction-by-instruction in a coordinated fashion to execute whatever warps are available. There is no overhead in switching (since all the registers are outside the CUDA cores and allocated statically to each thread for the duration of execution), so the architecture encourages the use of far more threads than CUDA cores.

Given that, you should think of the scheduling as a two stage process. When you launch a kernel, you create a bunch of blocks (each containing a fixed number of threads) that are thrown into a bucket. As multiprocessor resources become available, blocks are sent to them for execution, where they will stay until completion. At the multiprocessor level, there is a bucket of available warps drawn from all the blocks currently running on that multiprocessor. The multiprocessor scheduler (or more than one scheduler, depending on hardware) repeatedly pulls out warps from the bucket, and sends them off to a group of CUDA cores (8 or 16, depending on compute capability) to execute the next instruction for the entire warp of 32 threads, before returning the warp back to the bucket. The CUDA cores are pipelined, so many warps are at various stages of executing an instruction in any given clock cycle.

In light of that, neither scenario #1 nor #2 are an accurate description of what is going on.

  1. mykernel<<<1, 1024>>> creates a single block with 1024 threads. That block will be assigned to a single multiprocessor, that will execute the 1024/32 = 32 warps in the block using all the available CUDA cores in that one multiprocessor. The rest of the GPU multiprocessors will be unused. On a GTX 580, for example, 15/16 of the GPU will be idle.

  2. mykernel<<<2048, 1>>> creates 2048 blocks with 1 thread each. Depending on the resource usage of each block, one or more blocks will be sent to each multiprocessor. In the most favorable situation, 8 blocks will be sent to each multiprocessor. Again, in the case of the GTX 580, that would mean that there would be 128 blocks sent out in the first wave, with the remaining blocks left unstarted until slots opened up on multiprocessors that finished a block.

On each multiprocessor, the 1 thread per block would be scheduled as 1 warp with 31 empty slots. The warps would be sent to groups of 16 CUDA cores (in the case of Fermi-class GPUs, like the GTX 580), with CUDA cores sitting idle during clock cycles where they would normally be processing the other 31 threads in a full warp. So unlike situation #1, in situation #2 all multiprocessors will have something to do, but the CUDA cores within the multiprocessor will be doing nothing 31/32 of the time.

Oh, I now understand what tera was saying about thread-level and block-level parallelism. I now have another question: what benefits do two and three dimensional thread blocks bring?

They are just an organizational tool. Many data parallel problems naturally organize themselves on a 2D or 3D grid, so this way thread and block IDs can more naturally map to a multidimensional problem domain. You can do everything with 1D indexing and get the same speed.

Thanks so much for answering my questions! This has really helped my grasp how CUDA functions and how to program for it.