I’ve implemented an allocation callback that handles all the cases of TRITONSERVER_MemoryType, much like simple.cc does.
I’ve noticed however that after this allocation function is registered with TRITONSERVER_ResponseAllocatorNew, it seems to only ever be called with TRITONSERVER_MEMORY_CPU as the value of the memory_type parameter.
When the server loads/executes the model I’m using I do see GPU memory being used and GPU load in jtop, so I think everything is working OK.
But it leaves me curious: is the idea that Triton Server implementation itself will decide on my behalf which memory type/device is best given the model and available resources, and that in the case of my particular setup it just so happens to consider CPU allocations as preferable? Or will the callback always be called with TRITONSERVER_MEMORY_CPU as the default, and then the callback implementation is expected to make a choice? That’s easy enough to do since I have the actual parameters with which to answer, but I’m just not clear on the expectations. Or is there a setup function I’ve missed where I configure a preference? I saw TRITONSERVER_ResponseAllocatorSetQueryFunction, but after setting up a callback with that, I never saw the query callback actually get called.
What kind of errors do you encounter when using other memory types? Is it a runtime error or a compiling issue?
We do not encounter errors. Everything compiles and runs.
I want to know whether it is expected that, independently of what memory type I use for inputs, I always and only see output tensor allocations coming back from Triton with type CPU, even when the GPU is being used.
It makes me wonder whether I have misconfigured or mis-implemented something, or whether it is expected that Triton server always returns output tensors on CPU memory rather than GPU (or CPU_PINNED), or whether that behavior is model dependent and therefore might be the result of the specific model or backend that I’m using.