The difference between Graphic Processing and General Purpose Processing

Hi. I have some questions about How GPU’s Graphic processing and General Purpose processing works
I’d really appreciate your response.

  1. How GPU’s Graphic Processing pipeline works?
    1-1) How Texture Unit works
    1-2) How Texture Cache(L1 Cache) works
    1-3) What functions ROP Unit have and how it works
    1-4) When the RT Core is used and how it works

  2. How GPU’s General Purpose Processing pipeline works?
    2-1) LD/ST Operation case
    2-2) Arithmetic/Logical/Bitwise Operation case
    2-3) Atomic/Reduce Operation case

  3. Do GPUs share any hardware units in each workflow? If you have hardware units that you don’t share, why they don’t share?

That are a lot of questions, which each would fill a long article. AFAIK all the listed units are separately implemented in hardware with current Nvidia GPUs.

  1. How GPU’s Graphic Processing pipeline works?

There are multiple sources for understanding the 3D graphics pipeline. The NVIDIA GPUs support a 3D graphics engine and multiple compute engines among other engines.

The 3D engine has a hardware pipeline with many generic and re-usable components such as the Streaming Multi-processors (SM) and memory sub-system. However, there is an actual pipeline that is accelerated via custom 3D units such as graphics specific work distributors, primitive distributor, pre-ROP, raster, and ROP units.

1-1) How Texture Unit works

Documentation can be found in numerous whitepapers, DirectX/Vulkan programming guides, and CUDA programming guide. NVIDIA does not disclose architecture specific details.

1-2) How Texture Cache(L1 Cache) works

There are several GTC presentations describing the details on the L1 Data Cache.

1-3) What functions ROP Unit have and how it works

The feature set of the ROP unit can be found in the Vulkan, DirectX, and OpenGL specification. NVIDIA does not disclose architecture specific details.

1-4) When the RT Core is used and how it works

The RT Cores are responsible for Traversal and Intersections. The RT Cores are co-processors of each SM/L1. NVIDIA provides ray tracing via DirectX Raytracing, Vulkan Raytracing, and OPTIX.

  1. How GPU’s General Purpose Processing pipeline works?

GPGPU compute was initially implemented through the 3D graphics pipeline. CUDA architecture was invented to provide an architecture for using GPUs for data parallel compute. In modern NVIDIA GPUs General Purpose Processing is implemented through the compute engine.

2-1) LD/ST Operation case

See the link above on the L1 data cache. The SMs schedule and execute warps. A warp is a fixed group of 32 threads. Load operations can be dispatched to the constant caches, L1 data cache, shared memory, distributed shared memory, or texture unit. In all cases the instruction is converted into wavefronts/packets containing the list of load size, modifiers, and addresses or each thread. The memory unit performs the load operation and returns the data. Store operations are performed in a similar fashion. The link above provides more details on the stages in the L1 data cache.

The Hopper Architecture introduced a new unit called the Tensor Memory Accelerator for transferring large blocks of data efficiently between global and shared memory.

2-2) Arithmetic/Logical/Bitwise Operation case

The SMs supports hardware ALU data path that can execute common arithmetic, logical, and bitwise operations.

2-3) Atomic/Reduce Operation case

The SM Shared Memory Unit and the L2 Slices support ALU units for common atomic/reduction operations.