single GPU core vs. single CPU core

Hi.

What is the difference between a single CPU core and a single GPU core?

I am newbie in this area. Can you introduce some materials to study or watch to learn more?

I only know a bit about optimization for each one for throughput or latency that is always discussed in tutorials.

Is the difference in instruction set? what does the triangle in GPU context means?

the answer greatly depends on how much you know about CPU cores

first, thousands gpu cores in modern gpus is marketing lie - they call “gpu core” each ALU. in terminology of CPUs, SM/CU is like a module (in Ryzen or Bulldozer) combining 2-4 real cores plus some shared resources. Only intel EUs are real cores

Next, as you know CPU is latency-optimized and GPU is throughput-optimized. So,

  1. almost all modern CPUs are superscalar. GPUs have limited superscalarity - they may be able to start simultaneously commands for different EUs, f.e. ALU and load/store unit, in order to increase EU utilization, but they never had multiple EUs of the same kind in the single core

  2. most modern CPUs are out-of-order (most notable exception is A7/A53 “little cores” in smarthpones). GPUs are never out-of-order, except for their ability to not stall on LD operation until loaded register will be used in further computations

  3. GPUs are tend to share less frequently used EUs. in particluar, AMD GCN has a module (CU) combining 4 cores that share all EUs except for the most important one - vector ALU. OTOH, Intel iGPUs share nothing. NVidia is in the middle, sharing LD/ST engines between 2/4 cores, depeding on SM version. Among CPUs, only Bulldozer and Niagara are well-known examples of sharing some EUs

  4. While modern CPUs have multiple scalar ALUs, GPUs have little or no of them. Only AMD GCN has a scalar ALU with poor command set, useful only for index and predicate computations, and even this single poor ALU is shared among 4 cores in CU

  5. While CPUs have poor SIMD command sets, that support only most frequent computations, but full scalar command set, GPUs support full vector command set and little or no scalar commands. btw, AVX512 is half-way through it - its command set is much larger than f.e. SSE2 and supports full operation masking

  6. GPUs are much more advanced in the memory operations - they support efficient memory coalescing, banked shared memory, atomics on global/shared memory. For comparison, CPUs support only simplest atomic operations such as scalar LOCK XADD, and vector load was added only in AVX2, and still much less efficient than the same GPU operations (or scalar loads in the same CPU)

  7. GPUs run at smaller frequencies and may tolerate higher operation latencies. As result, their operation are tend to be much more complex. F.e. AMD has command that atomically adds two pairs of memory operands (per SIMD lane!) and writes two results back to the memory. Intel and AMD allows indexed access to registers. NVidia has simplest architectures among those 3, but still comprarble to most complex CISC CPUs of the past

  8. NVidia architecture is moving target - they make significant changes with each next major SM revision, breaking binary compatibility. AMD made big chnages only in 2011 and since then both AMD and Intel only incrementally improve their architectures

  9. modern CPUs tend to utilize SMT (hyper-threading), while GPUs employs barrel-threading where on each cycle the single gpu core executes commands from the single thread. Among well-known CPUs, only revolutionary Niagara done the same. Each Pascal/Volta core may run up to 32 threads simultaneously, so they always have a “backup thread” to start executing if current thread was stalled (up to dozens cycles for ALU command, up to thousands cyclers for a memory load)

  10. GPU memory and caches are also throughput-oriented, as opposite to CPU ones. The idea is to support thousands of requests-in-fly from thousands of threads that a single GPU can execute simultaneously. So, f.e. each Pascal core at each time moment may have 16 of those 32 threads stalled waiting for memory load, and remaining 16 threads sharing execution resources. Each individual thread may execute 10-100 times slower than on CPU, but resource utilization (and hence throughput) is higher than on CPU

As the example, i will describe Maxwell (SM 5.x) and so-called Paswell (SM 6.1) architecture:

Each module (SM) includes 4 cores. Each core has the followig set of 32-wide EUs:

  • ALU (fp32/int32/fp64)
  • SFU
  • LD/ST unit
  • branch unit

At each cycle, core can start 2 operations if they are the two next operations in the same thread, executed by different EUs and both EUs aren’t busy starting any previous command. Remember that execution is always in-order, so it can’t reorder commands in the same thread to better fill EUs

Operation delay/throughput vary. F.e. load/store commands has only 1/4 througput, i.e once they issued some command, they are busy for 4 cycles starting this command and no other LD/ST command can be started. FP64 throughput is 1/16. Fastest command, FMAD, has delay 6 and throughput 1

Further reading: “computer architectures: quantative approach” for a general understanding of CPU architectures

My own list of GPU architecure-related links (including amd/nvidia/intel): bulat-gpgpu-links.txt · GitHub

I found this video very helpful:
[url]CppCon 2016: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" - YouTube