Triton server

Is it possible to run custom CUDA kernels via Triton? such that for example using Amphere, I’d be able to run multiple different CUDA kernels on the different MIG parts and not worry about threads/streams/etc?


For those who like me have to consult the internet to disambiguate these terms, I think the following is being referred to in the question:

MIG = Multi-Instance GPU (MIG) :: CUDA Toolkit Documentation (
Triton = NVIDIA Deep Learning Triton Inference Server Documentation

I’m not aware that Triton supports this kind of activity “natively”. You can write your own triton backend however, so it may be possible to do something like that. You would probably still have to cook up a model definition and there would be other aspects of adherence to the triton architecture that would make this look quite a bit different than just launching kernels in CUDA code.

Thanks @Robert_Crovella, the reason why I’m asking this is that it occurred to me that if this was possible (or implemented by NVIDIA), I would basically be able to run multi-stream/multi kernel on the same gpu/multi gpu/multi machine CUDA kernels out of the box, wouldn’t I?

I guess, I could also write a dummy DL model with just one dummy plugin layer, which actually calls the plugin’s CUDA code, which would be my own…

just a thought… :)