TL;DR:
Can we run the same model in the same process simultaneously on both DLAs? The model partially falls back to GPU.
I have a model that partially runs on DLA(and fallback to GPU for the other part). In order to make it run faster, I try to group 2 inputs into a pair then enqueue them into each DLA, then sync and read the result.
I notice something weird:
Here is the code:
Initialize:
ICudaEngine* engines[2];
for (int i = 0; i < 2; i++) {
// read serialized DLA(with GPU fallback enabled) engine.
IRuntime *runtime = createInferRuntime(gLogger);
runtime->setDLACore(i);
engines[i] = runtime->deserializeCudaEngine(src.get_address(), src.get_size());
}
cudaStream_t stream[2];
IExecutionContext* contexts[2];
for (int i = 0; i < 2; i++) {
// allocate input_buffer[i]
...
// allocate out_buffer[i]
...
contexts[i] = engines[i]->createExecutionContext();
cudaStreamCreateWithFlags(stream[i], cudaStreamNonBlocking);
}
Run:
for (int i = 0; i < 2; i++) {
void *buffers[2];
buffers[inputIndex] = input_buffer[i]->gpu_buffer;
buffers[outputIndex] = out_buffer[i]->gpu_buffer;
contexts[i]->enqueueV2(buffers, stream[i], nullptr);
// cudaStreamSynchronize(stream[i]);
}
for(int i = 0; i < context_.size(); i ++) {
cudaStreamSynchronize(stream[i]);
}
// read the result.
...
Without the commented-out cudaStreamSynchronize, the processing time is significantly < 2x single run, but the result is just wrong and not stable(same input -> different output).
Only if I have the sync immediately after the enqueueV2(thus serialized the 2 inference call), the result is now correct, but then the process time == 2x single run(which make sense since the run is actually serialized).
Am I doing something wrong here? Is that possible the GPU side is sharing some context and overwriting each other here?