I’m working on optimizing GPU inference for various deep learning models (e.g., Transformer-based LLMs, diffusion models, and other general networks) using PyTorch on NVIDIA GPUs. I’m trying to identify the primary performance bottleneck: is my workload computation-bound (limited by the GPU’s floating-point operations) or memory-bound (limited by the memory bandwidth to fetch data)?
I know this depends heavily on the model architecture, batch size, and data type (e.g., FP16 vs. FP32). While the peak hardware specs are known, they’re often not achieved in practice. I’m looking for a robust methodology that combines both theoretical analysis and empirical profiling.
Specifically, I’d appreciate insights on:
-
Theoretical Analysis: What is a practical, step-by-step method to calculate the Arithmetic Intensity (AI=Total Bytes AccessedTotal FLOPs) for a given model layer (e.g., attention or convolution)? How can I use this to apply the Roofline Model to predict the bottleneck before running any code? But doing the derivation for different models actually takes quite time and may not accurate.
-
Empirical Analysis (Practical Tools): What are the key metrics and workflows using NVIDIA profiling tools like Nsight Systems and Nsight Compute to diagnose the bottleneck? For a deep learning kernel, what specific metrics can I focus on, and what values indicate a compute vs. memory bottleneck?
-
Model-Specific Nuances: How do the bottlenecks differ for different model types or phases? For instance, why is the Prefill phase of LLM inference often computation-bound while the Decode phase is almost always memory-bound? However, I only learn it from some paper or claims from internets, and they are intuitively correct to me. But I do not have detailed evidence either through calculation or empirical profiling evidence to support it and do not know how far it is from optimal setting, will computation-bound hit first or memory-bound hit first if I scale one variable, say context length.