max number of threads per core

Hello all,

There are a couple of points which are not clear to me, can anyone give me a hand?

I am missing the relation between the warp scheduler and the instruction dispatch unit (actually their role seems to overlap). In which document can i find the role of the two units?

In order to get rid of latency, ideally each core should have many threads running: when one is stalled waiting for the memory to be fetched, another will be activated. Which part of the device is responsible for deciding which thread will be activated (instruction dispatch unit or warp scheduler or other)?

What is the max number of concurrent threads running on a single core (I can’t find this number on the c.c., but I can find the “max number of warps per streaming multiprocessor” which is not exactly the same thing).
The only resoruce where I found this quantity explicitly is (and it was 48 for the card under description):
https://www3.nd.edu/~zxu2/acms60212-40212/Lec-11-GPU.pdf

thank you very much
P

I don’t think there is a document that defines these units in terms of a specification. The closest would probably be the whitepapers that are written to describe each architecture. Google is your friend. “fermi whitepaper” “kepler whitepaper” “pascal whitepaper” etc. There were some naming convention changes along the way but they are easy to find.

warp scheduler makes the determination

I think this is a bad way to think about it. A GPU core is really not conceptually the same as a CPU core. A CPU core contains all the resources to sustain a C or C++ thread of execution, including things like instruction fetch, dispatch, register files, execution units for various instruction types (e.g. ALU, etc.) and other resources.

The closest thing on a GPU to that description in my opinion is the SM, which is why you can readily find that number discussed in relationship to an SM (and even specified that way by NVIDIA).

A GPU core, on the other hand, is more closely related to an ALU. Its precise definition is a 32-bit Floating Point multiply-accumulate unit. It doesn’t support other instruction types, it does not have a register file itself, etc.

So I think your question is not really applicable to GPU cores. If I were forced, I would try to deduce the pipeline depth of the core, and report that. But again, I think this line of reasoning or comparison is illogical.

GPUs and CPUs really are different, and I advise those who are learning to be careful about applying concepts from one domain rigidly to the other domain.