TensorRT unnecessary synchronization in multi-GPU system

Since it seems this is a CUDA issue, I created a new thread here: CUDA won't concurrently run kernels on multiple devices from within same process