Triton Server In-Process API: Selecting the memory type for input tensors

I’m building an application using Triton Server’s in-process API, and running it on a Jetson Orin Nano Dev Kit running the latest Jetson Linux and Jetpack. I’m modeling the application on the example here:

The example supports all three TRITONSERVER_MemoryType values for the input tensors as selected by command line flags. However, neither the example nor the API docs provide much insight into when or how one might select one or the other, or what the performance implications of each are.

Does the right answer vary with the model? The hardware? If I’m building an application where I cannot know ahead of time what model is being run, should I let the operator decide which type of memory to ask for via configuration? If so, at what granularity should I let them do so? Per model? Per tensor per model? What guidance should I give to help the operator make the best choice given their environment?

Or have I overlooked something in the Triton API that informs me of what memory type the server would prefer I use for inputs for a given inference?


P.S. See also my related question about memory types and allocations for output tensors here: Triton Server In-Process API: Allocator callback always called with MEMORY_TYPE_CPU


The memory type is more related to the implementation, which means the frameworks (backend) you choose to use.
But in general, it’s expected that GPU memory type works better than CPU on the Jetson.


This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.