Is it possible to use a single instance of tensorrt context/CudaEngine with multiple streams concurrently?
In our problem, we don’t know a-priori the batchSize, it’s dependent on the value set in the configuration file. As a result at runtime, the batchSize might exceed the maxBatchSize used when serializing the engine.
I was hoping, I could split the batchSize into N maxBatchSize batches and process them in parallel. However, I get funny results back unless I put a cudaDeviceSynchronize between the context->encqueue invocations.