I’m a newbie in CUDA. The relationship between warp and core confuses me. From the angle of hardware, C1060, for example, has 30 MP (Multiprocessors) and each MP has 8 cores. However, 240 cores seems not able to work at same pace. At runtime, threads are divided into groups and each group (warp) includes 32 threads which run together. Each MP (only 8 cores) could have as many as 32 warps, ie, 1024 threads (!). There seems no way that 1024 threads run on only 8 cores at the same time. As mentioned in programming guide, “To execute an instruction for all threads of a warp, the warp scheduler must therefore issue the instruction over: 4 clock cycles for an integer or single-precision floating-point arithmetic instruction …” So even for simplest integer cases, a warp (only 32 threads) will be split into 4 quarter warp to run. To check how threads run together, I wrote the following code on C1060 in 64-bit Ubuntu:
#include <stdio.h>
// Programming guide: when clock() executed in device code, returns the value
// of a per-multiprocessor counter that is incremented every clock cycle
__global__ void clocktest(int* start, int* end)
{
int i = threadIdx.x;
start[i]=clock();
end[i]=1;
end[i]=clock();
}
int main()
{
int N=128, i;
size_t size = N * sizeof(int);
int *dStart,*dEnd;
cudaMalloc(&dStart, size);
cudaMalloc(&dEnd, size);
clocktest<<<1, N>>>(dStart, dEnd);
int* hStart = (int*)malloc(size);
int* hEnd = (int*)malloc(size);
cudaMemcpy(hStart, dStart, size, cudaMemcpyDeviceToHost);
cudaMemcpy(hEnd, dEnd, size, cudaMemcpyDeviceToHost);
cudaFree(dStart);
cudaFree(dEnd);
for(i=0;i<N;i++)
printf("%d %d %d %d\n", i%32, hStart[i], hEnd[i], hEnd[i]-hStart[i]);
free(hStart);
free(hEnd);
}
and got the following result:
So obviously 32 threads in each warp did start and end at the same time, although 4 warps run at different time. But how this was done? How threads spread on the cores? Many threads run on one core concurrently? What about the quarter warp (the result didn’t show quarter warp)? Anything wrong in my understanding of CUDA? Thanks for any comments.
clock() probably reads the cycle counter only once for all threads of a warp, similarly to how values from shared or global memory are multicast to all threads of a half-warp.
Yes, as programming guide hinted, clock() may not be a proper tool for tracing time of each thread. But is there any tool could do this job or be useful for the questions above?
What exactly is the question you want to answer? As you see from the output of your little program, the warps take turns at being executed on the cores. On 1.x devices usually a new instruction from one warp starts execution every four cycles. As the execution is heavily pipelined, several instructions from multiple warps will be in different stages of execution at any time.
Sorry, let me clarify my question. For, say, dual-core CPU systems, one arithmetic intensive program uses 50% CPU, two use 100%, as shown in system monitor. One could run more than 2 computation-intensive programs, but may not run efficiently as 2 programs do (each program would be slower). That is to say, 2 computation-intensive programs fully exploit computation power on dual-core system. So this is my simple mind for parallel computation: one thread uses one core, at least for those arithmetic intensive programs.
If GPUs obey the same rule, say, 240 cores for C1060 could run 240 threads CONCURRENTLY at most. However, documents say 1024 threads can be resident on one MP, ie, 1024*30 (=30720) threads can be resident on one C1060. But I don’t think 30720 threads run concurrently, or 30720 instructions (same or not) run at every cycle. So similar to CPU case, how many threads could run on 240 cores fully using GPU’s power? or how many instructions can run in one cycle on a C1060? 240? or something else? why? or due to limitation of warp schedulers of 1.x devices, computation power can never be fully used (less than 240)?
Actually when I modified the number of threads from 128 to 256 or 512 in the small program, average time elapsed (cycles; averaging the 4th column of the output) for each thread increased a lot (1 : 1.75 : 5.19) . So threads are running slower. Results for 32/64/128 threads had no significant difference. Does this hint that 128 threads be (or close to) the upper limit?
I thought 32 threads in a warp would actually take 32 cores to run concurrently (using my parallel mind). It seems documents say totally differently, however. So I titled the post “warp and core”. The answer to this question may be very helpful for designing programs maximizing usage of GPU. Thank you.
I think by ‘concurrency’ you mean in parallel. Multiple threads can’t run on the same core in parallel but they can run concurrently (by context switching every time slice or so). In gpu each MP has a warp scheduler that makes context switches between warps at every instruction (1 every 4 cycles so the 8 cores can process all 32 threads in a warp). This is to hide memory access latencies so if one warp issues load and must wait hundreds of cycles, the MP can avoid idling by processing the other warps first. With 1024 threads per MP, that’s 32 warps which gives the scheduler plenty of room to find TLP (thread level parallelism) and avoid memory or data dependency stalls.
So (1) in low level, 32 threads in a warp run on same MP, and warp scheduler switches threads in 4 cycles to run 32 threads on 8 cores. Even if an idle MP exists, GPU won’t split one warp and put, say, half-warp onto the idle MP to increase running speed.
(2) In higher level, MP switches warps to avoid waiting for slow memory accessing warps. But if, say, I have 2 warps and 2 idle MP, GPU won’t (no need) switch warps, ie. 2 warps run in parallel on 2 MP.
(3) Again from the view of low level, one core in any cycle still runs one instruction as CPU does. So for arithmetic intensive programs, eg. those mostly use registers (not slow global memory) to calculate, more than 8 threads running on one MP will not be significantly more efficient than running only 8 threads on one MP. That is to say, 8 arithmetic intensive threads make the computation power of a MP saturated. Or if there is a usage monitor for a MP, it will show the MP is 100% occupied.
Almost. It might be easier to think of CUDA as being like a 32 wide SIMD architecture. Each instruction get repeated 4 times to process the 32 SIMD registers using 8 physical cores (on a GT200). On Fermi, there are 32 cores per MP and instruction issue and retirement works differently. Of course, a CUDA thread has more state and can do things a simple op on a SIMD register cannot. But all of this happens at the instruction decoder level, not the MP scheduler. The latter only decides how warps should be pipelined for instruction issue, rather than anything lower level about how the instruction is executed and retired-
The first part is correct, the second part is not. MP are issued blocks, not warps. The scheduler inside each MP breaks the blocks into warps and schedules them to run. A block only ever runs on one MP, its warps also only run on the MP the block was assigned to. The level of GPU scheduling above an individual MP only deals with blocks, not warps.
Again, the first sentence is right, the rest is not. Each MP issues an instruction from one warp to its cores at a time. A warp is always 32 threads. If you have less that 32 threads on a MP (ie. less than 32 threads per block), the warp is padded with dummy threads, and you waste cycles. 8 threads per MP would waste about 95% of cycles of a given MP. The reason why more than 75% goes idle is because there is an instruction pipeline (about 21 cycles deep on the GT200 IIRC), so there is pipeline latency to worry about. For the GT200, you need to have 192 active threads per block to completely hide the instruction pipeline latency.
(1) is correct, for compute capability 1.x devices the warp scheduler selects a new warp to run every four cycles.
(2) is correct, (so far) the GPU never moves running warps to a different MP. Note that warps of the same block have to run on the same MP as they share the shared memory of the MP.
(3) is not correct, 8 threads never saturate a MP.
Firstly, on 1.x devices a warp always takes 4 cycles, even if threads are unused so that fewer cycles would theoretically suffice.
Secondly, execution is highly pipelined, which means that execution actually takes 22 to 24 cycles until its results are available, even though every 4 cycles a new instruction of a warp starts execution (8 threads every cycle for 32 threads per warp).
So if every instruction of a thread depends on the result of the previous instruction, 5 other warps can start execution of an instruction before the next instruction of the original warp is ready to run. If no other warps are available, these 5 slots are wasted. For this reason, you want at least 6 warps (192 threads) running per MP, unless you have specific code where no instruction depends on a result from the preceding instruction.
EDIT: Avidday already has given a more detailed answer - I should have reloaded before typing instead of wasting my time.
tera, you did not waste time. You explained from different angle which makes understanding better. Thank you all for detailed explanation.
(1) For instruction pipeline latency hiding, below is my current understanding. Is it correct?
Assume that 6 warps (192 threads) run one instruction on an MP on Tesla C1060 and this instruction takes 24 cycles to finish.
Cycle 1: 8 threads from, say, warp 1 start this instruction on 8 cores.
Cycle 2: Another 8 threads from warp 1 start the same instruction (the first 8 threads standby).
Cycle 3/4: The third/fourth 8 threads from warp 1 start this instruction. Now all 32 threads in warp 1 have started this instruction.
Cycle 5-8: Similarly, 32 threads from warp 2 start this instruction (warp 1 standby).
…
Cycle 21-24: 32 threads from warp 6 start this instruction (warp 1-5 standby). Now first cycle of this instruction finish in all 6 warps.
Cycle 25: 8 threads from warp 1 continue this instruction (process its second cycle) (other warps standby).
Cycle 26-28: The remnant 24 threads (evenly split into 3 groups) from warp 1 also process (in 3 cycles) the second cycle of this instruction.
…
Cycle 45-48: Similarly, 32 threads from warp 6 process the second cycle of this instruction. Now the second cycle of this instruction finish in all 6 warps.
…
(This 24-cycle process repeats until)
Cycle 573-576: 32 threads from warp 6 process the 24th cycle of this instruction. Now this instruction thoroughly finish in all 6 warps.
Then the next instruction starts…
(2) It seems the details of these CUDA mechanisms didn’t appear in Programming Guide or Best Practices Guide from developer download page (CUDA Toolkit 11.7 Update 1 Downloads | NVIDIA Developer). Are there any literature available for these “behind” topics?
There is dedicated hardware (stages of the pipeline) to perform each step of the instruction execution in parallel. So in the second cycle, when the second set of 8 threads from the first warp start execution, the first set of 8 threads already executes the second step in the second pipeline stage.
So during cycle 5, when the first 8 threads from the second warp start execution, you also have 8 threads from the first warp in each of the first to fourth stage of the pipeline.
In cycle 25, the first instruction is finished for the first 8 threads of the first warp, and the second instruction can start execution.
After 48(+3) cycles the first instruction is executed for all six warps.
Actually I think the the pipeline is only 22 stages so after 48+1 cycles the first instruction from each of the 6 threads has finished. But since this is not a multiple of 4 (and since we don’t know the exact internal implementation), that is not visible from outside.
The dedicated hardware (stages of pipeline) is very interesting since it greatly increases the computation power (in the above sample, reducing cycle number from 576 to 48+1). I have some questions for it, (1) is the hardware integrated in cores or separated? (2) What is the maximal number of threads that can use the hardware in parallel in one MP? (3) Local memory is used in each thread; Shared memory is used in the scope of blocks. Is there any scope or limitation for this hardware?
The answer to (1) is that the scheduler is at the MP level, not the individual cores. The answers to (2) and (3) are in Appendix G of the current programming guide for all hardware revisions.
Seeing as you are asking about the GT200, you probably might also want to read this article from a couple of years ago.