What is the difference between SP and CUDA core?

Hi all,

As we know, GTX1070 contains 1920 cuda cores and 15 streaming multiprocessors. Each SM has 128 cuda cores. However, according to the ‘CUDA_C_Programming_Guide’ by NVIDIA, the maximum number of resident threads per multiprocessor should be 2048.

Does it mean that one cuda core contains 16 resident threads, so cuda core is like 16 SPs combined?

If so, is the communication between the threads of different cuda cores, different from that between the threads of same cuda cores?

best regards

The most commonly used meaning of “core” is identical to the most commonly used meaning of SP (streaming processor) - they both refer to the functional units that support the single precision floating point add, multiply, and multiply-add instructions.

Its not correct to associate a thread of execution with a particular CUDA core. That’s not how the GPU works. A GPU SM includes a collection of functional units that each support different types of instructions. For example the LD/ST unit (load-store unit) supports LD and ST instructions. If a particular thread of execution has an LD instruction in it, that LD instruction will be issued to a LD/ST unit, not a CUDA core, and not a SP given the above commonly used definitions. Therefore threads are not uniquely associated with cores or SPs. In this sense the usage of the word “core” in typical GPU terminology is quite different from the typical usage in CPU terminology. Therefore understanding GPU thread-level execution requires that you divorce any notion of a thread of execution being associated with a particular core.

3 Likes

Thank you so much! @Robert_Crovella

So the number of CUDA cores as well as other functional units only determines the maximum number of the ACTIVE warps; It is the number of registers and shared memory (maybe and other resources) that determine the actual resident warps for one multiprocessor, right?

And if so, What is that determines the maximum number (64 for GTX1070) of RESIDENT warps per multiprocessor?

best regards

The maximum number of resident warps is a hardware limit. That’s why it is presented that way in the table. If you’re asking for some unpublished detail of the design of the SM that gives rise to that limit, I don’t have that info to share. The number of resident warps for a particular code will be a function of that code design against various hardware limits (such as registers per thread vs. maximum number of registers per SM). If none of the other limiting factors come into play, then the code should be able to achieve the maximum stated limit. Active is a new term you’ve brought into the discussion just now, so we’d have to carefully define that first.

Thank you! @Robert_Crovella

Yes I was asking for the details of the design.

By ‘active warps’ I was meaning the warps that are executing. Because a multiprocessor only has 4 warp schedulers, there are up to 4 warps are executing in any clock cycle.

best regards

I won’t be able to share non-public details of GPU design.

all execution units are pipelined, which means in any clock cycle many warps (more than 4) may be in various stages of executing. I think you are talking about “issued” warps. Even with “issued” warps, some GPUs have dual-issue warp schedulers, so some GPUs (Kepler comes to mind) can issue more than 4 warps in a clock cycle.

1 Like

Now I understand. Thank you for your help! @Robert_Crovella