I am optimizing the cuda program to get more throughput. Then I did the following experiment:
- Single thread serial executes 6 algorithm models
The result: up to 24 channels on the T4, and memory and GPU usage are still surplus. - Multithreading executes 6 algorithms in parallel, and uses multiple streams(create a stream by using cudaStreamCreateWithFlags and cudaStreamNonBlocking flags).
The result: up to 26 runs on the T4, and memory and GPU usage are still surplus.
I looked at nvidia’s official website. It suggests using a multi-threaded default stream to get more throughput. Is this still true in the latest cuda10? And how can i do to get more throughput?
If anyone has any idea, please let me know. Thanks!!!