some doubts about the task scheduling of NVIDIA GPU

I have some doubt about the task scheduling of nvidia GPU.
(1) If a wrap threads in a block(CTA) have finished but there remains other wraps running, will this wrap wait the others to finish? In other words, all threads in a block(CTA) release there resource when all threads are all finished, is it ok? I think this point should be right,since threads in a block share the shared memory and other resource, these resource allocated in a CTA size manager.
(2) If all threads in a block(CTA) hang-up for some long latency such as global memory access? will a new CTA threads occupy the resource which method like CPU? In other words, if a block(CTA) has been dispatched to a SM(Streaming Processors), if it will take up the resource until it finished?
I would be appreciate if someone recommend me some book or articles about the architecture of modern GPU.Thanks!

  1. some resources, such as shread memory, are allocated per-CTA, and they will definitely be freed when all warps are finished. other resources (such as registers) are allocted per-warp, so in theory they can be freed earlier. i don’t know whether it is really implemented

  2. the thing you asking is preempting threads. i.e. “swapping” running thread out of SM in order to release resources for other threads. afaik, it was advertized in latest nvidia GPUs, but not sure whether it was really implemented

  3. i will be glad to see that too. you can read through the links i collected and especially books, but i doubt that you will find a single comprehensive description:

The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, …). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.

When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.

When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.

Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.

This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.

Thanks for your reply.
Through your answer, it menas SM has two level resources: warp level and thread block(CTA) level, and warp level resources can be released once all threads in the warp has finished.
But I still do not understand the benefit of this method.Since the SM schedules the task in a unit of CTA. Even the warp resource has been released,it means nothing since the other warp in a different CTA can not issue instruction because of CTA-level resource lack.
And I do not know if it is right.For exmaple ,if a CTA has 10 warps, once a warp finished, there should be only 9 warps to be scheduled.With warps are finished one by one, there will be less warps in the CTA to be scheduled.Since the CTA use warps switch to hide latency, if there are less warps to be scheduled, the higher probability of the SM subpartition are stalled when threads meets long latency
And for CTA,if a block(CTA) has been dispatched to a SM(Streaming Processors), if it will take up the resource until it finished? or a CTA can be occupyed by other CTA when all CTA threads meets long latency
(this situation may not arise frequent,but I think it will be a commen thing when there is only one or two active warp in the CTA)?

Keep in mind that one design goal for GPUs is to keep the control structures small and simple, and then re-dedicated the freed up silicon real estate to additional execution units. One way of achieving that is to use coarse-grain control (one set of controls for multiple threads, or multiple warps), rather than the fine-grained control typically used in CPUs.

The end result is that, compared to CPUs, execution and resource controls in GPUs lack some flexibility, which can lead to low efficiency and poor resource utilization in some cases. However, for many common use cases these coarser controls are sufficient and the massive increase in raw computational horse power from the plethora of additional execution units more than makes up for occasionally lower efficiency.

Oh,thanks for your reply.I vaguely know that what you said means my assumptions are right,but not in a direct way.
Since I need to make a detailed description of GPU architecture in an article not just to programming in CUDA,it is really important for me to verify the assumption.And I have tried to find the answer in “CUDA C PROGRAMMING GUIDE” but just a simple introduction of GPU architecture.

Depending on what you mean by “detailed”, that may not be possible because NVIDIA does not make detailed descriptions of their GPU architectures publicly available. This is presumably motivated by the highly competitive nature of the GPU business: companies don’t want to give away their “secret sauce”.

What is made public in terms of architectural descriptions is what is needed to program in CUDA, at the level of Greg’s description above.

Various people have reverse engineered details of the GPU instruction set, and devised microbenchmarks to determine the properties of various structures in the GPU.