What is "Interactivity per User Tokens per Second" mentioned in Jensen's GTC 2024 talk

haitsing · April 9, 2024, 7:17am

I am wondering what is “Interactivity per User Tokens per Second” mentioned in Jensen’s GTC 2024 talk, as shown in the following figure x-axis. Why does y (Throughput per GPU Tokens per Second) decrease as x increases? In my opinion, y should increase as x increases because a user will recieve more tokens as the throughput increases.

Ddogge · April 11, 2024, 4:32am

I don’t know but I hope by hoping on this thread somebody at NVDA can answer.

xutingl · April 24, 2024, 8:56pm

IMHO “Interactivity per User” means that for a single user, what is the tokens/s this user experiences (single request throughput ignoring queueing etc.). y decreases with x because different parallelism strategies usually helps with one while hurting another (DP allows the system to work on multiple requests for higher throughput, while each request is slower). So a curve being upper right is strictly better with all possible combination of parallelisms.

e.g. For TP64, all 64 GPUs are working on one request at the same time, so this one user enjoys high tokens/s. On the other hand, DP4 can work on 4 requests at the same time (16 GPUs work on 1 request). The overall throughput is higher, but each one of the 4 users are experiencing a lower tokens/s.

haitsing · April 26, 2024, 12:56am

Thanks a lot for your clear explanation. Maybe it’s better to mark the number of users in the figure.

system · May 10, 2024, 12:57am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
the latency time is linearly increasing when concurrent threads increase more than 2 TensorRT	6	1280	March 15, 2019
Questions about p2pBandwidthLatencyTest CUDA Programming and Performance	2	851	July 16, 2019
Trying to understand Transactions per request for P100 CUDA Programming and Performance	2	1451	February 26, 2018
How to explain when increase blocks from 1 to n per MP, throughput suddenly drop at some point Same CUDA Programming and Performance	4	843	February 15, 2011
Latency linearly increases when increased batch size or concurrent models TensorRT inference-server-triton	15	2038	September 29, 2021
Waiting for global memory access. CUDA Programming and Performance	32	56342	January 31, 2008
NVLINK GPU Metrics Breakdown Profiling Linux Targets	5	49	February 17, 2025
From low end GPUs to high end GPUs Moving from 9600GT to Tesla T10 provides no improvement, why ? CUDA Programming and Performance	24	17297	June 8, 2010
I cannot use GPU well Jetson TX2	4	552	October 18, 2021
GTC Keynote Thread CUDA Programming and Performance	31	19816	May 23, 2012

What is "Interactivity per User Tokens per Second" mentioned in Jensen's GTC 2024 talk

Related topics