I am wondering what is “Interactivity per User Tokens per Second” mentioned in Jensen’s GTC 2024 talk, as shown in the following figure x-axis. Why does y (Throughput per GPU Tokens per Second) decrease as x increases? In my opinion, y should increase as x increases because a user will recieve more tokens as the throughput increases.
I don’t know but I hope by hoping on this thread somebody at NVDA can answer.
IMHO “Interactivity per User” means that for a single user, what is the tokens/s this user experiences (single request throughput ignoring queueing etc.). y decreases with x because different parallelism strategies usually helps with one while hurting another (DP allows the system to work on multiple requests for higher throughput, while each request is slower). So a curve being upper right is strictly better with all possible combination of parallelisms.
e.g. For TP64, all 64 GPUs are working on one request at the same time, so this one user enjoys high tokens/s. On the other hand, DP4 can work on 4 requests at the same time (16 GPUs work on 1 request). The overall throughput is higher, but each one of the 4 users are experiencing a lower tokens/s.
Thanks a lot for your clear explanation. Maybe it’s better to mark the number of users in the figure.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.