What is the best way to run multiple TRT threads on multiple GPU with each context process same video frame?

dedoogong · June 17, 2019, 1:19am

I need to execute 8 tensorRT instances on 2 GPUs,that is, 4 executionContext for each GPU.
I think I need to create 2 engineBuilder, 8 threads(each starts with calling cudaSetDevice()), 4x2 executionContext, 4x2 cuda streams as below;

engine1(running on main process)-> cxt1(running on thread1 with stream 1), 
                                -> cxt2(running on thread2 with stream 2), 
                                -> cxt3(running on thread3 with stream 3), 
                                -> cxt4(running on thread4 with stream 4) 

engine2(running on main process)-> cxt5(running on thread5 with stream 5), 
                                -> cxt6(running on thread6 with stream 6), 
                                -> cxt7(running on thread7 with stream 7), 
                                -> cxt8(running on thread8 with stream 8)

but it seems those instance runnings are still interleaved, not ideal parallel execution.
Then, I would apply MPS. so, instead of 4 context for each engine and GPU, just use 1 context for 1 engine.

[GPU 0]
MPS 1 : engine 1 -> ctx1(no thread/stream explicit creation) ; it processes area 1 of the same video frame
MPS 2 : engine 1 -> ctx2(no thread/stream explicit creation) ; it processes area 2 of the same video frame
MPS 3 : engine 1 -> ctx3(no thread/stream explicit creation) ; it processes area 3 of the same video frame
MPS 4 : engine 1 -> ctx4(no thread/stream explicit creation) ; it processes area 4 of the same video frame
---------------------------------------------------------
[GPU 1]
MPS 1 : engine 2 -> ctx5(no thread/stream explicit creation) ; it processes area 5 of the same video frame
MPS 2 : engine 2 -> ctx6(no thread/stream explicit creation) ; it processes area 6 of the same video frame
MPS 3 : engine 2 -> ctx7(no thread/stream explicit creation) ; it processes area 7 of the same video frame
MPS 4 : engine 2 -> ctx8(no thread/stream explicit creation) ; it processes area 8 of the same video frame

BUT,
one IMPORTANT point is, I need to keep those TRT contexts running synchronized because it must process the same video on the same frame count.
I guess, MPS is good choice only in the case of processing separate input date because it makes each engine/context run independently, asynchronously.

Is there a good way to share a kind of current frame number status while each one runs concurrently?
I think MPS enables that each context shares memory usage, but I don’t know how to programmatically handle it.

Topic		Replies	Views
how to run trt in multithreading？ Jetson TX2	15	8069	October 18, 2021
Parallel execution of several trt contexts on one GPU TensorRT onnx	1	1285	August 7, 2023
Concurrent instances of TensorRT TensorRT	0	727	March 9, 2019
Execute multiple TensorRT TensorRT	1	695	October 22, 2019
Is multi threaded execution possible with tensorRT? TensorRT	3	2288	April 13, 2020
Is TensorRT safe to create engine & context in one thread, and execute in another thread? TensorRT	1	722	June 5, 2022
How to run TensorRT on a muliti-GPU platform. TensorRT	2	3476	October 12, 2021
Batch inference parallelization on tensorrt TensorRT tensorrt , cuda	5	996	May 5, 2021
How to use TensorRT by the multi-threading package of python Jetson AGX Xavier tensorrt	13	19036	October 18, 2021
Multi-process running tensorRT Jetson AGX Xavier tensorrt	5	1615	October 18, 2021

What is the best way to run multiple TRT threads on multiple GPU with each context process same video frame?

Related topics