Significance of Multiprocessor Cores


The number of GPU cores if often quoted as multiprocessors times cores, but I am confused about it’s relevance. The CUDA documentation seems to talk about multiprocessors, blocks and threads, but never cores. Where do they fit in?

For example, the throughput (section 5.4.1 in the CUDA programming guide) is given per multiprocessor, so having more codes doesn’t seem to speed things up. Or this actually the throughput per per core? Should I start at least as many blocks as there are multiprocessors or as there are cores?


Unlike conventional CPU cores, Nvidia uses the word “core” more in a sense of what conventionally would be called ALU (arithmetic-logic unit) or FPU (floating point unit). The significance mainly is that (so far) the maximum arithmetic throughput is easily calculated as number of cores times cycles per second times 2 (as a core can do a multiply-add per cycle, counting as two arithmetic operations).

Having more cores leads to a better throughput, as the same instruction is executed for more different threads (or in some cases, more instructions from the same number of threads are executed in parallel).

You should start a lot more threads than there are either multiprocessors or cores. This will allow to hide latencies. As a rule of thumb, using 24 times as many threads as there are cores allows to completely hide instruction latencies (as the latency of each instruction is about 20…24 cycles, and each core executes an instruction per cycle).

Makes sense, thanks!