I’ve been reading through the Programming Guide for CUDA, and I’m a bit confused about warps. On my system, the warp size is 32, and I have 8 cores per multiprocessor. It says that a whole warp is always executed together in parallel on a single multiprocessor. How can 32 threads execute in true parallel fashion on only 8 cores? Or are 4 threads assigned to each core and there’s actually timeslicing and all that involved?
Correct. This is one of the reasons the programming guide encourages you to have many warps active per multiprocessor. It helps to avoid pipeline hazards if you can fill the pipeline with many threads operating on independent register files.
(Incidentally, this is also why hyperthreading is a win on Intel chips. Two threads per core makes it easier to keep the instruction pipelines full by providing the scheduler with independent instructions. Although the analogy is extremely strained for other reasons, the CUDA situation is kind of like 4-way hyperthreading from a pipeline scheduling point-of-view.)
Oh, Thanks Sarnath
I was not clear about the warps from the starting itself,
Let me put my doubt and I will clear with all your help
Why they have chosen 4 clock cycle, It could be 1 cycle too with 8 threads executed.
Is it something like when 1 thread is executed when waiting … other thread which is ready is scheduled to hide the latency.
In that case why warp is being reffered as “threads executing at the same time”?
May be, because they can access 32 register-files at a time simulatenously…
Actually, if u look @ the CUDA occupancy calculator, if you look @ the registers used by 32-wide Block, it is actually 64registers instead of 32registers.
There sure is something there.
Also each Multiprocessor has 8 SP arithmetic units. When multiple-warps are issuing arithmetic instructions (usually take 4 clock cycles to 16 clock cycles to complete) – what about the pressure on these arithmetic units? Does any1 know about it?
Thinking about it, Now I understand how this pipelning works and why people are talking about it.
If you look @ the programming guide, floating point division etc… require almost 36(32/0.88) clock cycles to complete …
Now, consider the WARP processing cores issuing 8 instructions every clock cycle (32 instructions in 4 clock cycles). So, what if there are floaing point divisions in instruction stream… Will the warp scheduler wait for 36 clk cycles before scheduling next warp??? The answer is no. And, that is because of pipelining.
The hardware has a floating point pipeline and can deliver one instruction per cycle throughput if regularly fed. Now, if this was a 4-stage pipeline - you can easily saturate this pipeline with 4 threads executing the same instruction (with different operands anyway). Since all 4 floating point instructions have separate operands (register operands are per thread) – there wont be any pipeline hazard at all.
Now, I am just guessing that the pipeline is 4-staged. It could have more stages.
The reason why the p.g requires 192 threads per multi-processor is that – if the WARP scheduler is scheduling lesser warps (say only 64 threads) then read-write hazard happens between instructions – because the scheduler has to schedule the same WARP again and again. Since instructions are related to each other (one instruction output serving as input for another), they have to stall waiting for the previous instruction to move out of the pipeline and commit its results.
May be this 192 threads is an indication of pipeline depth. Probably the pipeline has maximum of 24-stages. Thus when 6 WARPs are scheduled one after another, probably in RR fashion, each WARP stuffs 4 instructions on to pipeline. The SAME WARP is scheduld after 24 clocks. Thus if the 4 instructions belonging to those threads are already commited then they can move on the pipeline without any hazard.
May b, some1 else could validate/add more facts to this.
In many other architectures (Power, Intel) the division is not pipelined, so for this example I suspect the answer is yes. Because of the lack of documentation on the nvidia GPUs we need to test for it, but division is an hell operation from the electronic point of view… since gpus are architecturally simpler than cpus I do expect this operation is not pipelined as well.
Avoid divisions in your algorithm cores (the optimizer will remove them if possible).
Some Nvidia guy can comment on this?
On the other side it is true for MULADD and many of the basic operations. In all the processors, the math instructions (floating point for sure - I guess integer too) are executed pipelined. That is new instructions start being fed out of order in execution as there is some free ALU. Free means that the 1st step of the instruction has been executed. Typical pipeline lengths on CPUs is 3 to 6.
With SIMD approach (SSE and nvidia) you pre-fill your operands in the simd-registers so that there is no logic to look for - just take what come next - and push in the 1st step at the next clock cycle. The bottleneck become filling these registers.
Now what happen is that because of architectural limitations usually processors with higher clocks as longer pipeline (the single steps are electronically simpler and can get executed faster), lower clocks, shorter pipeline.
Longer pipeline means you need more logic to guarantee to fill all the steps (imagine a grid… n. of ALU x n. of steps… more is filled, closer you are to the peak performance) and tipically this won’t happends because of other bottleneck (memory at the 1st place). So you have processors at different clock speed that have different peak performance that performs the same… in the past AMDs had always shorter pipelines, and in fact they was performing better. Intel too now is doing a great job in this aspect - in the past they have worked a lot to get “high frequency numbers :D” - that means nothing from the performance point of you of reall apps.
Intel and IBM has invented the hyperthreading concept (IBM… htx…? i dont remember), but it was not worthwhile in my experience with computing apps - the scheduler was unable to optimize the memory bottleneck - I have heard rumors that newer I7/nehalem architectures are much better (expecially because of the newer memory controller/channels, MUUUUCH faster), but these are based on the out-of-order execution core. So I suspect that faster SSE operations won’t benefit of it, for FP intensive alghorithms.
In Nvidia architecture things are much more deterministic - you do not have the out-of-order unit (right nvidia guys?) but MANY actual execution unit that execute the same instruction on many independent cores (thread in half-warp). Each one on its registers, in a pipelined way (if the algorithm allows for it).