About optimize cuda program and get more throughput on T4

147060616 · August 4, 2019, 4:19pm

I am optimizing the cuda program to get more throughput. Then I did the following experiment:

Single thread serial executes 6 algorithm models
The result: up to 24 channels on the T4, and memory and GPU usage are still surplus.
Multithreading executes 6 algorithms in parallel, and uses multiple streams(create a stream by using cudaStreamCreateWithFlags and cudaStreamNonBlocking flags).
The result: up to 26 runs on the T4, and memory and GPU usage are still surplus.

I looked at nvidia’s official website. It suggests using a multi-threaded default stream to get more throughput. Is this still true in the latest cuda10? And how can i do to get more throughput?

If anyone has any idea, please let me know. Thanks!!!

Topic		Replies	Views
Is 40000 cuda streams an issue? TensorRT	3	360	September 13, 2021
Multithread does not improve inference performance with tensorrt models TensorRT tensorrt	2	1182	May 11, 2021
Speeding up multi-threaded C++ program of TensorRT models TensorRT tensorrt	7	1357	February 20, 2025
Cuda Streams and multiple processes CUDA Programming and Performance	1	1964	May 3, 2020
multi-thread multi stream optimization with cublas CUDA Programming and Performance	0	1075	August 9, 2018
Fine grain threading, correct logic? CUDA Programming and Performance	0	1014	August 4, 2009
Does cuda run more threads than physical threads transparently? CUDA Programming and Performance cuda	6	443	August 14, 2023
TensorRT MultiThread with MultiGPU TensorRT	1	493	February 14, 2023
Speedup by increasing # of streams vs. batch size TensorRT	2	710	June 23, 2022
Designing a CUDA algo question Sort of a newbie question.... CUDA Programming and Performance	2	2370	December 9, 2011

About optimize cuda program and get more throughput on T4

Related topics