Question about EKS Infrastructure

david.newmon · August 13, 2025, 2:59pm

I’m curious about running the Fine-Tuning Microservice (FTMS) API on Amazon EKS.

Do I need to run the API itself on a GPU instance or is there a connection between the FTMS and the a “gpu-operator” pod?

I’m not sure the relationship between these two sets of instructions:

Morganh · August 14, 2025, 8:02am

There is an indirect but essential connection between the Fine-Tuning Microservice (FTMS) API and the NVIDIA “gpu-operator” pod in a Kubernetes environment such as Amazon EKS.

The FTMS API provides the interface and orchestration layer to manage fine-tuning jobs, handling API requests and scheduling tasks.
The “gpu-operator” pod is a Kubernetes operator responsible for managing GPU resources on the cluster nodes. It installs and maintains NVIDIA drivers, CUDA support, device plugins, and other necessary GPU software to enable GPU workloads on those nodes.
The FTMS API itself does not run on GPU nodes nor does it interact directly with the gpu-operator pod. Instead, it submits fine-tuning tasks as Kubernetes pods, which request GPU resources.
The gpu-operator ensures the GPU resources are available and properly configured on nodes, so that the pods scheduled by the FTMS API which require GPUs can run smoothly on GPU-enabled nodes.

In summary, the FTMS API depends on the presence and functionality of the gpu-operator-managed GPU nodes in the cluster to run GPU-accelerated workloads, but they handle different levels:

FTMS API: application-level orchestration of ML fine-tuning tasks.
gpu-operator pod: cluster-level management of GPU resources and drivers.

They work together to enable GPU compute tasks on EKS but do not have direct interaction beyond this cooperative role. The gpu-operator allows the FTMS API scheduled pods to utilize GPU resources by managing the underlying GPU infrastructure.

Use the FTMS API Helm chart instructions to deploy and manage your fine-tuning microservice.

Use the GPU Operator installation instructions to set up the GPU infrastructure so Kubernetes can schedule GPU workloads.

They complement each other in your Kubernetes cluster to enable fine-tuning jobs on GPUs.

david.newmon · August 15, 2025, 5:24pm

Yesterday, I spent a good amount of time trying to start from a clean slate and try again and hit multiple road blocks.

The Cluster Configuration YAML didn’t work for me…

The ami and the overrideBootstrapCommand mentioned here don’t work together:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: demo-cluster
  region: us-west-2
  version: "1.25"
nodeGroups:
  - name: demo-gpu-workers
    instanceType: g4dn.xlarge
    ami: ami-0770ab88ec35aa875
    amiFamily: Ubuntu2004
    minSize: 1
    desiredCapacity: 3
    maxSize: 3
    volumeSize: 100
    overrideBootstrapCommand: |
      #!/bin/bash
      source /var/lib/cloud/scripts/eksctl/bootstrap.helper.sh
      /etc/eks/bootstrap.sh ${CLUSTER_NAME} --container-runtime containerd --kubelet-extra-args "--node-labels=${NODE_LABELS}"
    ssh:
      allow: true
      publicKeyPath: ~/.ssh/id_rsa.pub

The overrideBootstrapCommand failed because /etc/eks/bootstrap.sh doesn’t exist in this AMI.

I tried several flavors of ami and amiFamily with multiple AMIs and none of them had this /etc/eks/bootstrap.sh command. I don’t know how to identify the correct AMI, as there are no obvious choices from the AMI marketplace. It would seem like you have to use both of these fields, or you can omit both of them to let AWS decide based on the instanceType. You can also omit the overrideBootstrapCommand field when you allow AWS to decide for you. This might simplify the configuration file to simply:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: demo-cluster
  region: us-west-2
  version: "1.25"
nodeGroups:
  - name: demo-gpu-workers
    instanceType: g4dn.xlarge
    minSize: 1
    desiredCapacity: 3
    maxSize: 3
    volumeSize: 100
    ssh:
      allow: true
      publicKeyPath: ~/.ssh/id_rsa.pub

Here are relevant GitHub posts I found while sifting through the AWS Documentation that your team might find important:

github.com/eksctl-io/eksctl

(Future) Breaking: overrideBootstrapCommand soon to be required with Custom AMIs for AL2 and Ubuntu unmanaged nodegroups

opened 04:43PM - 12 Apr 21 UTC

closed 09:04AM - 02 Mar 22 UTC

Callisto13

kind/breaking

# Note 1: Future change is described below, no action required from users yet #… Note 2: This is an announcement thread, not a "problem with `overrideBootstrapCommand`" thread. Please do NOT spam this thread with problems you are having. Please open a new issue with your specific failure. ## Do I need to read this issue/Will this change affect me? ![Inception Template (1)](https://user-images.githubusercontent.com/8898786/127893505-d16658d3-54f7-4aa7-93b8-6d83c1029172.jpg) ## TLDR: From `eksctl` version `0.47.0` onwards all AL2 and Ubuntu **unmanaged** nodes will be bootstrapped using the AMI provided bootstrapping script. For most people, this change should not be noticeable: you don't need to take any action. For those using custom AL2/Ubuntu AMIs, you will soon (BUT NOT YET) need to provide some extra configuration. To give users more time to move over to this new requirement, the legacy codepath will still be present and followed when providing a custom AMI. When enough time has passed (release date TBD), this legacy path will be removed. When that happens, it means that when using a Custom AMI, it will be required to set the `overrideBootstrapCommand` in order for the nodes to successfully join the cluster. In all other cases, this change should not have any effect. ## What's happening? We recently merged [some work](https://github.com/weaveworks/eksctl/pull/3564) to remove the custom bootstrapping scripts which `eksctl` drops on **AmazonLinux2** and **Ubuntu** machines so that they can join the k8s cluster. From version `0.47.0`, all `eksctl` nodes will be bootstrapped using the `bootstrap.sh` script provided by standard (non-custom) EKS AMIs (mostly, there are still some odd "wrapper" scripts there to tidy things up). A side effect of this improvement is that (in the near future) an `overrideBootstrapCommand` will be required when creating **un**managed nodegroups with **custom AL2/Ubuntu amis**. In previous `eksctl` versions this was not necessary, since `eksctl` would simply send everything it needed to the node, bootstrapping nodes with custom and non-custom AMIs alike. Now that `eksctl` delegates to the built-in `/etc/eks/bootstrap.sh` on standard public images, any node which does not have that file on disk will fail to join the cluster. The solution will be to provide an `overrideBootstrapCommand` when setting a Custom AMI for **un**managed nodegroups. This is already standard practice when using Custom AMIs to create managed nodegroups via `eksctl`. **For now all users who are providing custom AL2/Ubuntu AMIs do not need to do anything: a custom AMI would follow the legacy codepath. This issue will be updated once we have a date for the removal of the legacy path, at which point an `overrideBootstrapCommand` will be required.** ## Who will this affect? The future removal of the legacy path around this change will only affect those using **custom** AL2 or Ubuntu AMIs to create **unmanaged** nodegroups. The experience of creating: - managed nodegroups - unmanaged nodegroups based on other AMI Families - **any nodegroup** with **standard** public AMIs will not change. ## Why change things? Maintaining our own set of bootstrapping code has become a tiresome burden, especially when the upstream AMIs add more things which we needed to consciously step over. Some recent clashes with the GPU AMIs made us want to sort this out fairly quickly. More context can be found [here](https://github.com/weaveworks/eksctl/issues/3464), [here](https://github.com/weaveworks/eksctl/issues/3446) and [here](https://github.com/weaveworks/eksctl/pull/3492). ## When will the legacy path be removed? This is currently under discussion and this issue will be updated when a firm date/release number is set. Once a decision is made and a release announced, from then on any nodegroups created with custom AL2 or Ubuntu AMIs will need to have `overrideBootstrapCommand`s set. Again, this will only affect the use of **custom AL2/Ubuntu AMIs**. Nothing else will change or be noticeable. ## How can I avoid getting caught out by this change? Documentation (including example configuration) will be written in advance of the legacy path going away. Until then, the [managed nodegroup examples](https://github.com/weaveworks/eksctl/blob/main/examples/15-managed-nodes.yaml#L38-L45) can serve as an accurate blueprint of the upcoming unmanaged nodegroup UX. ## I have questions... Please ask away on this issue 😄 -------------------------- ## Community note If you will be affected by this change (ie, if you routinely use Custom AL2/Ubuntu AMIs for self-managed nodegroups), please add a 👍 to this description so we can get an idea of numbers. Thanks 😄 .

github.com/eksctl-io/eksctl

Removed legacy path for al2 and ubuntu

main ← Skarlso:fix_override_and_containerd

opened 12:07PM - 17 Mar 22 UTC

Skarlso

+114 -3120

### Description Closes https://github.com/weaveworks/eksctl/issues/4930 IT… IS TIME! TIME WE MOVE ON! https://github.com/weaveworks/eksctl/issues/3563 - announced that we will deprecate certain code paths regarding AL2 and bootstrapping. And now it has come to pass. TODO: - [x] Update documentation - have a clear header and documentation on how you can migrate - [x] Manual test is running as I write this - [x] Fix the unit tests - [ ] Announce in slack Started an integration test run for this branch... ### Checklist - [x] Added tests that cover your change (if possible) - [x] Added/modified documentation as required (such as the `README.md`, or the `userdocs` directory) - [x] Manually tested - [x] Made sure the title of the PR is a good description that can go into the release notes - [x] (Core team) Added labels for change area (e.g. `area/nodegroup`) and kind (e.g. `kind/improvement`) ### BONUS POINTS checklist: complete for good vibes and maybe prizes?! :exploding_head: - [ ] Backfilled missing tests for code in same general area :tada: - [x] Refactored something and made the world a better place :star2:

After this cluster was created, I reviewed the Next Steps at the bottom:

This then points me to installing the GPU Operator:

After this, I attempted to resume the original TAO FTMS instructions:

Then I ran into issues with networking it seemed like. Not sure, but the “tao-client login” command was failing, and the Kubernetes logs were not showing a failed HTTP request so it was requesting something and failing, but not the correct URL / endpoint.

Also, the TAO API was running on the g4dn.large nodes in the cluster (using most of the RAM, 3 GB of 16 GB was free on one node).

Should I have spun up a generic c4.medium node group as well as the gpu-operator cluster? It’s not clear how I would indicate where the pods should execute based on the Nvidia documentation.

Morganh · August 21, 2025, 8:49am

So, the gpu-operator is installed successfully, right? Could you double check its health?

$ kubectl get pods -n gpu-operator

Then check TAO-API.

service: kubectl get svc -n <namespace>

Ingress: kubectl get ingress -n <namespace>

Then check logs.

kubectl logs <pod-name> -n <namespace>

Topic		Replies	Views
NVIDIA GPU Operator: Simplifying GPU Management in Kubernetes Technical Blog	0	488	August 25, 2020
Amazon Elastic Kubernetes Services Now Offers Native Support for NVIDIA A100 Multi-Instance GPUs Technical Blog	0	333	October 22, 2021
Anyone using Amazon EC2 Cluster GPU? CUDA Programming and Performance	7	37308	July 3, 2011
Amazon Expands NVIDIA GPU-backed EKS Availability Technical Blog	0	218	August 21, 2022
Adding MIG, Preinstalled Drivers, and More to NVIDIA GPU Operator Technical Blog	1	459	July 7, 2021
GPU in microk8s is not appearing as an add-on General Discussion	1	1020	November 26, 2021
AutoML training speed and GPU problem TAO Toolkit	28	1440	March 29, 2023
MIG-GPU Support in Kubernetes TAO Toolkit	9	571	June 26, 2022
Nvidia Image support eks cluster? GPU - Hardware ubuntu	1	722	May 3, 2023
Adding More Support in NVIDIA GPU Operator Technical Blog	0	350	January 26, 2021

Related topics