TL;DR:
Can we run the same model in the same process simultaneously on both DLAs? The model partially falls back to GPU.
I have a model that partially runs on DLA(and fallback to GPU for the other part). In order to make it run faster, I try to group 2 inputs into a pair then enqueue them into each DLA, then sync and read the result.
I notice something weird:
Here is the code:
Initialize:
ICudaEngine* engines[2];
for (int i = 0; i < 2; i++) {
// read serialized DLA(with GPU fallback enabled) engine.
IRuntime *runtime = createInferRuntime(gLogger);
runtime->setDLACore(i);
engines[i] = runtime->deserializeCudaEngine(src.get_address(), src.get_size());
}
cudaStream_t stream[2];
IExecutionContext* contexts[2];
for (int i = 0; i < 2; i++) {
// allocate input_buffer[i]
...
// allocate out_buffer[i]
...
contexts[i] = engines[i]->createExecutionContext();
cudaStreamCreateWithFlags(stream[i], cudaStreamNonBlocking);
}
Run:
for (int i = 0; i < 2; i++) {
void *buffers[2];
buffers[inputIndex] = input_buffer[i]->gpu_buffer;
buffers[outputIndex] = out_buffer[i]->gpu_buffer;
contexts[i]->enqueueV2(buffers, stream[i], nullptr);
// cudaStreamSynchronize(stream[i]);
}
for(int i = 0; i < context_.size(); i ++) {
cudaStreamSynchronize(stream[i]);
}
// read the result.
...
Without the commented-out cudaStreamSynchronize, the processing time is significantly < 2x single run, but the result is just wrong and not stable(same input → different output).
Only if I have the sync immediately after the enqueueV2(thus serialized the 2 inference call), the result is now correct, but then the process time == 2x single run(which make sense since the run is actually serialized).
Am I doing something wrong here? Is that possible the GPU side is sharing some context and overwriting each other here?
Update: I think you’re right, I believe it is how I setup the memory causing the problem: a simple version I prepared as example actually works well. I change how memory is managed on my code and it works now, with both DLAs run at the same time and it is faster and accurate.
Hi @wsmlby - I’m having some problems with this issue as well.
Do you know how much the model runs on the GPU and on the DLAs? did you check that when running on the GPU and DLA together you still get better performance?
Which Jetpack version are you using?
Hi @wsmlby,
I am still facing some issues here. Especially when I try to run DLA + GPU.
Could you please elaborate on what did you mean by changing the memory management that solved this issue?
Any chance you can share the code?