I have been working with building and training a model in Python using TensorFlow. Eventually I will want to move this model into an existing C# application to perform inference on the model. My question is what is the best way to do this? Can I use TensorRT for deploying the model into the C# environment?
TensorRT provides C++ and Python APIs for custom applications, and a command line tool called trtexec, all of which can be used for inference. You’ll need to communicate with one of these from your C# application to do inference.
Thanks for the reply. So then I have a followup question.
I am interested in passing data already in the GPU into a DL network with the output still residing in the GPU to be passed to the next element in a pipeline. If possible I’d like to do this without having to perform CPU/GPU copies. Essentially this would involve passing a cuda array to a model. More importantly, would this be possible if the application runs within C#? Could C# communicate to either the C++ or Python API (or even the command line tool) to perform this type of task?
As of now I am working with the GPU using managedCuda: https://kunzmi.github.io/managedCuda/
I don’t think I have any official samples to point you to for that, but in general the DL frameworks usually have some kind of GPU/Device memory objects that you can likely re-use to avoid copying back and forth.
I think the code snippet from this GitHub issue would be along those lines for doing this with PyTorch: https://github.com/NVIDIA/TensorRT/issues/305#issue-543768773
Specifically these lines where he creates arrays/tensors on the GPU with PyTorch, does some computation with them in PyTorch, and re-uses them as the bindings for the TensorRT engine:
with open(trt_file, "rb") as f: engine = runtime.deserialize_cuda_engine(f.read()) context = engine.create_execution_context() outputs = [torch.zeros(size=i, dtype=torch.float32).to("cuda:0") for i in output_shapes] bindings = [i.data_ptr() for i in its] + [i.data_ptr() for i in outputs] print(bindings) context.execute_async(1, bindings, torch.cuda.current_stream().cuda_stream) torch.cuda.current_stream().synchronize()
Though I’m not sure if this code is 100% correct, It might be a good reference. You could probably use this same idea with some CUDA/Thrust device arrays directly if using those.