Triton Server In-Process API: Allocator callback always called with MEMORY_TYPE_CPU

I’m building an application using Triton Server’s in-process API, and running it on a Jetson Orin Nano Dev Kit running the latest Jetson Linux and Jetpack. I’m modeling the application on the simple.cc example here: https://github.com/triton-inference-server/server/blob/main/src/simple.cc.

I’ve implemented an allocation callback that handles all the cases of TRITONSERVER_MemoryType, much like simple.cc does.

I’ve noticed however that after this allocation function is registered with TRITONSERVER_ResponseAllocatorNew, it seems to only ever be called with TRITONSERVER_MEMORY_CPU as the value of the memory_type parameter.

When the server loads/executes the model I’m using I do see GPU memory being used and GPU load in jtop, so I think everything is working OK.

But it leaves me curious: is the idea that Triton Server implementation itself will decide on my behalf which memory type/device is best given the model and available resources, and that in the case of my particular setup it just so happens to consider CPU allocations as preferable? Or will the callback always be called with TRITONSERVER_MEMORY_CPU as the default, and then the callback implementation is expected to make a choice? That’s easy enough to do since I have the actual parameters with which to answer, but I’m just not clear on the expectations. Or is there a setup function I’ve missed where I configure a preference? I saw TRITONSERVER_ResponseAllocatorSetQueryFunction, but after setting up a callback with that, I never saw the query callback actually get called.

Thanks,
Andrew

Hi,

How do you confirm GPU memory is occupied?

Jetson is a shared memory system, CPU and GPU use the same physical memory.
How do you know from jtop that Tritonserver is using the GPU memory?

Suppose jtop should reports the system memory?

Thanks

Well, in the GPU tab of jtop I see my process listed, and it shows it as using 1.0G of system RAM and 1.6G of GPU. Please see the following screen shot:

So, I’m fairly sure the model is loaded on the GPU, unless I’m totally misinterpreting what jtop is saying.

But I’m happy to use another tool than jtop if you can suggest one to make a clearer determination that GPU memory is, in fact, being used.

Hi,

Thanks for the feedback.

So when using TRITONSERVER_MEMORY_CPU, the sample works but GPU memory is used.
If testing with other memory types, the sample cannot work. Is that correct?

If yes, could you share the source code with us to reproduce it?
What kind of errors do you encounter when using other memory types? Is it a runtime error or a compiling issue?

Thanks.

So when using TRITONSERVER_MEMORY_CPU, the sample works but GPU memory is used.
If testing with other memory types, the sample cannot work. Is that correct?

No, not exactly. Please see below.

If yes, could you share the source code with us to reproduce it?

Code is available here: https://github.com/viamrobotics/viam-mlmodelservice-triton/blob/main/src/viam_mlmodelservice_triton_impl.cpp

What kind of errors do you encounter when using other memory types? Is it a runtime error or a compiling issue?

We do not encounter errors. Everything compiles and runs.

I want to know whether it is expected that, independently of what memory type I use for inputs, I always and only see output tensor allocations coming back from Triton with type CPU, even when the GPU is being used.

It makes me wonder whether I have misconfigured or mis-implemented something, or whether it is expected that Triton server always returns output tensors on CPU memory rather than GPU (or CPU_PINNED), or whether that behavior is model dependent and therefore might be the result of the specific model or backend that I’m using.

Thanks,
Andrew

Hi,

Which backend do you use?
For some backends, only the CPU tensor is supported on Jetson.

Thanks.

Hi, we are using only the TensorFlow backend right now, and we are using GPU memory for inputs to the model.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.