- How GPU’s Graphic Processing pipeline works?
There are multiple sources for understanding the 3D graphics pipeline. The NVIDIA GPUs support a 3D graphics engine and multiple compute engines among other engines.
The 3D engine has a hardware pipeline with many generic and re-usable components such as the Streaming Multi-processors (SM) and memory sub-system. However, there is an actual pipeline that is accelerated via custom 3D units such as graphics specific work distributors, primitive distributor, pre-ROP, raster, and ROP units.
1-1) How Texture Unit works
Documentation can be found in numerous whitepapers, DirectX/Vulkan programming guides, and CUDA programming guide. NVIDIA does not disclose architecture specific details.
1-2) How Texture Cache(L1 Cache) works
There are several GTC presentations describing the details on the L1 Data Cache.
1-3) What functions ROP Unit have and how it works
The feature set of the ROP unit can be found in the Vulkan, DirectX, and OpenGL specification. NVIDIA does not disclose architecture specific details.
1-4) When the RT Core is used and how it works
The RT Cores are responsible for Traversal and Intersections. The RT Cores are co-processors of each SM/L1. NVIDIA provides ray tracing via DirectX Raytracing, Vulkan Raytracing, and OPTIX.
- How GPU’s General Purpose Processing pipeline works?
GPGPU compute was initially implemented through the 3D graphics pipeline. CUDA architecture was invented to provide an architecture for using GPUs for data parallel compute. In modern NVIDIA GPUs General Purpose Processing is implemented through the compute engine.
2-1) LD/ST Operation case
See the link above on the L1 data cache. The SMs schedule and execute warps. A warp is a fixed group of 32 threads. Load operations can be dispatched to the constant caches, L1 data cache, shared memory, distributed shared memory, or texture unit. In all cases the instruction is converted into wavefronts/packets containing the list of load size, modifiers, and addresses or each thread. The memory unit performs the load operation and returns the data. Store operations are performed in a similar fashion. The link above provides more details on the stages in the L1 data cache.
The Hopper Architecture introduced a new unit called the Tensor Memory Accelerator for transferring large blocks of data efficiently between global and shared memory.
2-2) Arithmetic/Logical/Bitwise Operation case
The SMs supports hardware ALU data path that can execute common arithmetic, logical, and bitwise operations.
2-3) Atomic/Reduce Operation case
The SM Shared Memory Unit and the L2 Slices support ALU units for common atomic/reduction operations.