I’m building an application using Triton Server’s in-process API, and running it on a Jetson Orin Nano Dev Kit running the latest Jetson Linux and Jetpack. I’m modeling the application on the simple.cc
example here: https://github.com/triton-inference-server/server/blob/main/src/simple.cc.
The simple.cc
example supports all three TRITONSERVER_MemoryType
values for the input tensors as selected by command line flags. However, neither the example nor the API docs provide much insight into when or how one might select one or the other, or what the performance implications of each are.
Does the right answer vary with the model? The hardware? If I’m building an application where I cannot know ahead of time what model is being run, should I let the operator decide which type of memory to ask for via configuration? If so, at what granularity should I let them do so? Per model? Per tensor per model? What guidance should I give to help the operator make the best choice given their environment?
Or have I overlooked something in the Triton API that informs me of what memory type the server would prefer I use for inputs for a given inference?
Thanks,
Andrew
P.S. See also my related question about memory types and allocations for output tensors here: Triton Server In-Process API: Allocator callback always called with MEMORY_TYPE_CPU