single GPU core vs. single CPU core

yosoufe · May 21, 2017, 9:50am

Hi.

What is the difference between a single CPU core and a single GPU core?

I am newbie in this area. Can you introduce some materials to study or watch to learn more?

I only know a bit about optimization for each one for throughput or latency that is always discussed in tutorials.

Is the difference in instruction set? what does the triangle in GPU context means?

BulatZiganshin · May 22, 2017, 10:29am

the answer greatly depends on how much you know about CPU cores

first, thousands gpu cores in modern gpus is marketing lie - they call “gpu core” each ALU. in terminology of CPUs, SM/CU is like a module (in Ryzen or Bulldozer) combining 2-4 real cores plus some shared resources. Only intel EUs are real cores

Next, as you know CPU is latency-optimized and GPU is throughput-optimized. So,

almost all modern CPUs are superscalar. GPUs have limited superscalarity - they may be able to start simultaneously commands for different EUs, f.e. ALU and load/store unit, in order to increase EU utilization, but they never had multiple EUs of the same kind in the single core
most modern CPUs are out-of-order (most notable exception is A7/A53 “little cores” in smarthpones). GPUs are never out-of-order, except for their ability to not stall on LD operation until loaded register will be used in further computations
GPUs are tend to share less frequently used EUs. in particluar, AMD GCN has a module (CU) combining 4 cores that share all EUs except for the most important one - vector ALU. OTOH, Intel iGPUs share nothing. NVidia is in the middle, sharing LD/ST engines between 2/4 cores, depeding on SM version. Among CPUs, only Bulldozer and Niagara are well-known examples of sharing some EUs
While modern CPUs have multiple scalar ALUs, GPUs have little or no of them. Only AMD GCN has a scalar ALU with poor command set, useful only for index and predicate computations, and even this single poor ALU is shared among 4 cores in CU
While CPUs have poor SIMD command sets, that support only most frequent computations, but full scalar command set, GPUs support full vector command set and little or no scalar commands. btw, AVX512 is half-way through it - its command set is much larger than f.e. SSE2 and supports full operation masking
GPUs are much more advanced in the memory operations - they support efficient memory coalescing, banked shared memory, atomics on global/shared memory. For comparison, CPUs support only simplest atomic operations such as scalar LOCK XADD, and vector load was added only in AVX2, and still much less efficient than the same GPU operations (or scalar loads in the same CPU)
GPUs run at smaller frequencies and may tolerate higher operation latencies. As result, their operation are tend to be much more complex. F.e. AMD has command that atomically adds two pairs of memory operands (per SIMD lane!) and writes two results back to the memory. Intel and AMD allows indexed access to registers. NVidia has simplest architectures among those 3, but still comprarble to most complex CISC CPUs of the past
NVidia architecture is moving target - they make significant changes with each next major SM revision, breaking binary compatibility. AMD made big chnages only in 2011 and since then both AMD and Intel only incrementally improve their architectures
modern CPUs tend to utilize SMT (hyper-threading), while GPUs employs barrel-threading where on each cycle the single gpu core executes commands from the single thread. Among well-known CPUs, only revolutionary Niagara done the same. Each Pascal/Volta core may run up to 32 threads simultaneously, so they always have a “backup thread” to start executing if current thread was stalled (up to dozens cycles for ALU command, up to thousands cyclers for a memory load)
GPU memory and caches are also throughput-oriented, as opposite to CPU ones. The idea is to support thousands of requests-in-fly from thousands of threads that a single GPU can execute simultaneously. So, f.e. each Pascal core at each time moment may have 16 of those 32 threads stalled waiting for memory load, and remaining 16 threads sharing execution resources. Each individual thread may execute 10-100 times slower than on CPU, but resource utilization (and hence throughput) is higher than on CPU

As the example, i will describe Maxwell (SM 5.x) and so-called Paswell (SM 6.1) architecture:

Each module (SM) includes 4 cores. Each core has the followig set of 32-wide EUs:

ALU (fp32/int32/fp64)
SFU
LD/ST unit
branch unit

At each cycle, core can start 2 operations if they are the two next operations in the same thread, executed by different EUs and both EUs aren’t busy starting any previous command. Remember that execution is always in-order, so it can’t reorder commands in the same thread to better fill EUs

Operation delay/throughput vary. F.e. load/store commands has only 1/4 througput, i.e once they issued some command, they are busy for 4 cycles starting this command and no other LD/ST command can be started. FP64 throughput is 1/16. Fastest command, FMAD, has delay 6 and throughput 1

Further reading: “computer architectures: quantative approach” for a general understanding of CPU architectures

My own list of GPU architecure-related links (including amd/nvidia/intel): bulat-gpgpu-links.txt · GitHub

yosoufe · July 13, 2017, 8:12pm

I found this video very helpful:
[url]CppCon 2016: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" - YouTube

Topic		Replies	Views
GPU vs CPU performance comparison CUDA Programming and Performance	9	15036	August 13, 2009
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1876	October 26, 2017
GPU cores vs CPU cores? CUDA Programming and Performance	6	9609	April 26, 2010
CUDA C Programming Guide is not clear on what cuda cores are (and how much they execute at a time). CUDA Programming and Performance	5	1488	June 20, 2011
I hope to know that, why GPU faster than CPU in Ge CUDA Programming and Performance	5	4251	December 28, 2007
\|\| programming, basic question CUDA Programming and Performance	18	1299	April 30, 2018
A different way to think about writing GPU applications CUDA Programming and Performance	18	3639	January 4, 2013
Why is my single thread GPU speed 1000x faster than my CPU? CUDA Programming and Performance	14	4861	January 9, 2017
Survey of common CUDA runtime profiles What is your application like? CUDA Programming and Performance	11	7506	February 2, 2010
Simultaneous kernel executions not possible? Disappointing news for me CUDA Programming and Performance	7	6099	November 3, 2008

single GPU core vs. single CPU core

Related topics