There is an indirect but essential connection between the Fine-Tuning Microservice (FTMS) API and the NVIDIA “gpu-operator” pod in a Kubernetes environment such as Amazon EKS.
The FTMS API provides the interface and orchestration layer to manage fine-tuning jobs, handling API requests and scheduling tasks.
The “gpu-operator” pod is a Kubernetes operator responsible for managing GPU resources on the cluster nodes. It installs and maintains NVIDIA drivers, CUDA support, device plugins, and other necessary GPU software to enable GPU workloads on those nodes.
The FTMS API itself does not run on GPU nodes nor does it interact directly with the gpu-operator pod. Instead, it submits fine-tuning tasks as Kubernetes pods, which request GPU resources.
The gpu-operator ensures the GPU resources are available and properly configured on nodes, so that the pods scheduled by the FTMS API which require GPUs can run smoothly on GPU-enabled nodes.
In summary, the FTMS API depends on the presence and functionality of the gpu-operator-managed GPU nodes in the cluster to run GPU-accelerated workloads, but they handle different levels:
FTMS API: application-level orchestration of ML fine-tuning tasks.
gpu-operator pod: cluster-level management of GPU resources and drivers.
They work together to enable GPU compute tasks on EKS but do not have direct interaction beyond this cooperative role. The gpu-operator allows the FTMS API scheduled pods to utilize GPU resources by managing the underlying GPU infrastructure.
Use the FTMS API Helm chart instructions to deploy and manage your fine-tuning microservice.
Use the GPU Operator installation instructions to set up the GPU infrastructure so Kubernetes can schedule GPU workloads.
They complement each other in your Kubernetes cluster to enable fine-tuning jobs on GPUs.
The overrideBootstrapCommand failed because /etc/eks/bootstrap.sh doesn’t exist in this AMI.
I tried several flavors of ami and amiFamily with multiple AMIs and none of them had this /etc/eks/bootstrap.sh command. I don’t know how to identify the correct AMI, as there are no obvious choices from the AMI marketplace. It would seem like you have to use both of these fields, or you can omit both of them to let AWS decide based on the instanceType. You can also omit the overrideBootstrapCommand field when you allow AWS to decide for you. This might simplify the configuration file to simply:
Here are relevant GitHub posts I found while sifting through the AWS Documentation that your team might find important:
After this cluster was created, I reviewed the Next Steps at the bottom:
This then points me to installing the GPU Operator:
After this, I attempted to resume the original TAO FTMS instructions:
Then I ran into issues with networking it seemed like. Not sure, but the “tao-client login” command was failing, and the Kubernetes logs were not showing a failed HTTP request so it was requesting something and failing, but not the correct URL / endpoint.
Also, the TAO API was running on the g4dn.large nodes in the cluster (using most of the RAM, 3 GB of 16 GB was free on one node).
Should I have spun up a generic c4.medium node group as well as the gpu-operator cluster? It’s not clear how I would indicate where the pods should execute based on the Nvidia documentation.