customized inference engine using NIM.
why it is needed: Existing NIM is closed source and rigid for customizations.
inference engine algorihtm: NIM is using a general inference engine algorithm. (that is, softmax(QK’/sqrt(d))V to get O and then use other methods to get a row vector and then additional algorithms to get one token inddex for next output). This algorithm need not be the only way for inference.
Customers could prefer alternative algorithms.
But NIM isnot providing any mechanism for users to specify custom inference engines.
Advantages of NIM: It has parallizations, reusing precomputed values, using optimzed libraries with names cuXXXXX to exploit GPU features.
Disadvantages of NIM: no flexibility in customization of algorithm.
Possible enhancements:
Should give flexiblity
a. waht matrices needs to be loaded from gguf (need not be standard Mq,Mk,Mv but could be some other custom defined in gguf)
b. what matrices needs to be precomputed per model/layer/head from th ematrices read in step a.
c. algorihtms needed to compute for input to softmax . (Usually it is Q .@ K.T /sqrt(d)). But it could be something else as per user expectations)
d. algorihtm used for softmax computation itself.
e. algorhtm used to consume output of softmax to produce output matrix (potentially , a embeddings matrix or probablity distribution or any other custom)
f. ability to obtain next token id from ouitput of e.
Ability to provide customiztaion , either by definign a yml file or with custom c++ classes and methods and providing a predefined lirary to pass these custom method names as input for cutomiztaion is a need of the hour.
I have tried accessing API. but API seems focussed more on using inference engine into cloud, but no foucs on others.
So, request help , to access either source code of NIM,
or an assurance or a point of contact from NIM developer team