Model Inference in C#

solarflarefx · December 28, 2019, 1:06am

I have been working with building and training a model in Python using TensorFlow. Eventually I will want to move this model into an existing C# application to perform inference on the model. My question is what is the best way to do this? Can I use TensorRT for deploying the model into the C# environment?

NVES_R · December 28, 2019, 8:14pm

Hi,

TensorRT provides C++ and Python APIs for custom applications, and a command line tool called trtexec, all of which can be used for inference. You’ll need to communicate with one of these from your C# application to do inference.

solarflarefx · January 2, 2020, 8:41pm

Thanks for the reply. So then I have a followup question.

I am interested in passing data already in the GPU into a DL network with the output still residing in the GPU to be passed to the next element in a pipeline. If possible I’d like to do this without having to perform CPU/GPU copies. Essentially this would involve passing a cuda array to a model. More importantly, would this be possible if the application runs within C#? Could C# communicate to either the C++ or Python API (or even the command line tool) to perform this type of task?

As of now I am working with the GPU using managedCuda: managedCuda

NVES_R · January 3, 2020, 5:37pm

Hi,

I don’t think I have any official samples to point you to for that, but in general the DL frameworks usually have some kind of GPU/Device memory objects that you can likely re-use to avoid copying back and forth.

I think the code snippet from this GitHub issue would be along those lines for doing this with PyTorch: https://github.com/NVIDIA/TensorRT/issues/305#issue-543768773

Specifically these lines where he creates arrays/tensors on the GPU with PyTorch, does some computation with them in PyTorch, and re-uses them as the bindings for the TensorRT engine:

with open(trt_file,  "rb") as f:
        engine = runtime.deserialize_cuda_engine(f.read())
        context = engine.create_execution_context()
    
    outputs = [torch.zeros(size=i, dtype=torch.float32).to("cuda:0") for i in output_shapes]
    
    bindings = [i.data_ptr() for i in its] + [i.data_ptr() for i in outputs]
    print(bindings)
    context.execute_async(1, bindings, torch.cuda.current_stream().cuda_stream)
    torch.cuda.current_stream().synchronize()

Though I’m not sure if this code is 100% correct, It might be a good reference. You could probably use this same idea with some CUDA/Thrust device arrays directly if using those.