I need to execute 8 tensorRT instances on 2 GPUs,that is, 4 executionContext for each GPU.
I think I need to create 2 engineBuilder, 8 threads(each starts with calling cudaSetDevice()), 4x2 executionContext, 4x2 cuda streams as below;
engine1(running on main process)-> cxt1(running on thread1 with stream 1),
-> cxt2(running on thread2 with stream 2),
-> cxt3(running on thread3 with stream 3),
-> cxt4(running on thread4 with stream 4)
engine2(running on main process)-> cxt5(running on thread5 with stream 5),
-> cxt6(running on thread6 with stream 6),
-> cxt7(running on thread7 with stream 7),
-> cxt8(running on thread8 with stream 8)
but it seems those instance runnings are still interleaved, not ideal parallel execution.
Then, I would apply MPS. so, instead of 4 context for each engine and GPU, just use 1 context for 1 engine.
[GPU 0]
MPS 1 : engine 1 -> ctx1(no thread/stream explicit creation) ; it processes area 1 of the same video frame
MPS 2 : engine 1 -> ctx2(no thread/stream explicit creation) ; it processes area 2 of the same video frame
MPS 3 : engine 1 -> ctx3(no thread/stream explicit creation) ; it processes area 3 of the same video frame
MPS 4 : engine 1 -> ctx4(no thread/stream explicit creation) ; it processes area 4 of the same video frame
---------------------------------------------------------
[GPU 1]
MPS 1 : engine 2 -> ctx5(no thread/stream explicit creation) ; it processes area 5 of the same video frame
MPS 2 : engine 2 -> ctx6(no thread/stream explicit creation) ; it processes area 6 of the same video frame
MPS 3 : engine 2 -> ctx7(no thread/stream explicit creation) ; it processes area 7 of the same video frame
MPS 4 : engine 2 -> ctx8(no thread/stream explicit creation) ; it processes area 8 of the same video frame
BUT,
one IMPORTANT point is, I need to keep those TRT contexts running synchronized because it must process the same video on the same frame count.
I guess, MPS is good choice only in the case of processing separate input date because it makes each engine/context run independently, asynchronously.
Is there a good way to share a kind of current frame number status while each one runs concurrently?
I think MPS enables that each context shares memory usage, but I don’t know how to programmatically handle it.