How to Accurately Determine if a Deep Learning Inference Workload is Computation-Bound or Memory-Bound on an NVIDIA GPU?

qsh.zh27 · August 20, 2025, 3:51pm

I’m working on optimizing GPU inference for various deep learning models (e.g., Transformer-based LLMs, diffusion models, and other general networks) using PyTorch on NVIDIA GPUs. I’m trying to identify the primary performance bottleneck: is my workload computation-bound (limited by the GPU’s floating-point operations) or memory-bound (limited by the memory bandwidth to fetch data)?

I know this depends heavily on the model architecture, batch size, and data type (e.g., FP16 vs. FP32). While the peak hardware specs are known, they’re often not achieved in practice. I’m looking for a robust methodology that combines both theoretical analysis and empirical profiling.

Specifically, I’d appreciate insights on:

Theoretical Analysis: What is a practical, step-by-step method to calculate the Arithmetic Intensity (AI=Total Bytes AccessedTotal FLOPs) for a given model layer (e.g., attention or convolution)? How can I use this to apply the Roofline Model to predict the bottleneck before running any code? But doing the derivation for different models actually takes quite time and may not accurate.
Empirical Analysis (Practical Tools): What are the key metrics and workflows using NVIDIA profiling tools like Nsight Systems and Nsight Compute to diagnose the bottleneck? For a deep learning kernel, what specific metrics can I focus on, and what values indicate a compute vs. memory bottleneck?
Model-Specific Nuances: How do the bottlenecks differ for different model types or phases? For instance, why is the Prefill phase of LLM inference often computation-bound while the Decode phase is almost always memory-bound? However, I only learn it from some paper or claims from internets, and they are intuitively correct to me. But I do not have detailed evidence either through calculation or empirical profiling evidence to support it and do not know how far it is from optimal setting, will computation-bound hit first or memory-bound hit first if I scale one variable, say context length.

Topic		Replies	Views
Is there any tool which can tell my kernel is compute bound or memory bound CUDA Programming and Performance	7	6067	April 3, 2010
How to display latency, memory ops, and arithmetic ops in Compute Profiler CUDA Programming and Performance	2	2975	October 4, 2010
optimazing the programming of cuda CUDA Programming and Performance	0	4373	May 30, 2010
Bandwidth limited, Latency limited and Compute limited Need examples for each case CUDA Programming and Performance	1	6493	March 17, 2010
Molecular Dynamics On 670 on the new cards become twice as fast CUDA Programming and Performance	8	1573	June 29, 2012
Arithmetic intensity - How low do you go CUDA Programming and Performance	2	4887	September 16, 2011
How to tell if a kernel is memory or compute bound CUDA Programming and Performance	8	9382	February 4, 2010
Help analysing kernel performance through nSight CUDA Programming and Performance	2	813	January 22, 2014
Question relating to computation-limited application CUDA Programming and Performance	1	2776	November 4, 2008
memory bound CUDA Programming and Performance	3	1213	April 10, 2013

How to Accurately Determine if a Deep Learning Inference Workload is Computation-Bound or Memory-Bound on an NVIDIA GPU?

Related topics