What is the best way to run multiple TRT threads on multiple GPU with each context process same video frame?

I need to execute 8 tensorRT instances on 2 GPUs,that is, 4 executionContext for each GPU.
I think I need to create 2 engineBuilder, 8 threads(each starts with calling cudaSetDevice()), 4x2 executionContext, 4x2 cuda streams as below;

engine1(running on main process)-> cxt1(running on thread1 with stream 1), 
                                -> cxt2(running on thread2 with stream 2), 
                                -> cxt3(running on thread3 with stream 3), 
                                -> cxt4(running on thread4 with stream 4) 

engine2(running on main process)-> cxt5(running on thread5 with stream 5), 
                                -> cxt6(running on thread6 with stream 6), 
                                -> cxt7(running on thread7 with stream 7), 
                                -> cxt8(running on thread8 with stream 8)

but it seems those instance runnings are still interleaved, not ideal parallel execution.
Then, I would apply MPS. so, instead of 4 context for each engine and GPU, just use 1 context for 1 engine.

[GPU 0]
MPS 1 : engine 1 -> ctx1(no thread/stream explicit creation) ; it processes area 1 of the same video frame
MPS 2 : engine 1 -> ctx2(no thread/stream explicit creation) ; it processes area 2 of the same video frame
MPS 3 : engine 1 -> ctx3(no thread/stream explicit creation) ; it processes area 3 of the same video frame
MPS 4 : engine 1 -> ctx4(no thread/stream explicit creation) ; it processes area 4 of the same video frame
---------------------------------------------------------
[GPU 1]
MPS 1 : engine 2 -> ctx5(no thread/stream explicit creation) ; it processes area 5 of the same video frame
MPS 2 : engine 2 -> ctx6(no thread/stream explicit creation) ; it processes area 6 of the same video frame
MPS 3 : engine 2 -> ctx7(no thread/stream explicit creation) ; it processes area 7 of the same video frame
MPS 4 : engine 2 -> ctx8(no thread/stream explicit creation) ; it processes area 8 of the same video frame

BUT,
one IMPORTANT point is, I need to keep those TRT contexts running synchronized because it must process the same video on the same frame count.
I guess, MPS is good choice only in the case of processing separate input date because it makes each engine/context run independently, asynchronously.

Is there a good way to share a kind of current frame number status while each one runs concurrently?
I think MPS enables that each context shares memory usage, but I don’t know how to programmatically handle it.