Error in "clara platform start"

I’m in first starting clara platform. But encountered an error…

E1103 16:46:09.203699 30215 portforward.go:400] an error occurred forwarding 33695 -> 44134: error forwarding port 44134 to pod a6dc9bfaa181aaeca1fd9deb8609e74256aaba24c2a2d979b697e58d89367b0f, uid : exit status 1: 2020/11/03 16:46:09 socat[30348] E getaddrinfo(“localhost”, “NULL”, {1,2,1,6}, {}): Name or service not known

I edited ‘/etc/resolv.conf’ file but still command didn’t work.
before: ‘nameserver: 127.0.1.1’
after: add ‘nameserver: 8.8.8.8’

Please tell me if there are any additional information needed to resolve the issue.

Thank you for your interest in Clara Deploy. Would appreciate providing some additional details about your HW (CPU/GPU, System Memory, GPU Memory etc…) and also assuming you’ve gone through the bootstrap.sh script, could you please share the pertinent logs?

Ubuntu 16.04
(I have to use this version for another project)

cpu processor(=32) * cores(=16)
model name : AMD Ryzen 9 3950X 16-Core Processor

stepping : 0

microcode : 0x8701013

cpu MHz : 2195.032

cache size : 512 KB

cpu cores : 16

apicid : 31

initial apicid : 31

fpu : yes

fpu_exception : yes

GPU
0e:00.0 VGA compatible controller: NVIDIA Corporation GV102 (rev a1)

0f:00.0 VGA compatible controller: NVIDIA Corporation GV102 (rev a1)

$ grep MemTotal /proc/meminfo
MemTotal: 65939156 kB

I couldn’t find the previous installation log, so I thought it would be better to reinstall. but, the same error occurred during the deletion process.

And the error was changed when I edited ‘/etc/resolv.conf’

(skipped #~)
nameserver 127.0.1.1
nameserver 8.8.8.8 # added

Before, the error was “Temporary failure in name resolution”

sudo helm version sudo helm list
: also return same error

$sudo kubectl -n kube-system log tiller
[main] 2020/11/03 06:52:14 Starting Tiller v2.15.2 (tls=false)
[main] 2020/11/03 06:52:14 GRPC listening on :44134
[main] 2020/11/03 06:52:14 Probes listening on :44135
[main] 2020/11/03 06:52:14 Storage driver is ConfigMap
[main] 2020/11/03 06:52:14 Max history per release is 0

I solved the problem
I dont know exactly how I fixed. I did

  1. add core dns IP to ‘/etc/resol.conf’
  2. also reviesed ‘/etc/systemd/system/kubelet.service.d/10-kubeadm.conf’
  3. execute ‘bootstrap.sh’ again

but encountered other error

$ sudo clara platform start
Error: could not find a ready tiller pod
Usage:
platform start [flags]

Flags:
-h, --help help for start

Global Flags:
–config string config file (default is $HOME/.clara/config.yaml)
–verbose verbose output

Error: could not find a ready tiller pod

but tiller pod is still ‘running’ state

Hi Alex,

Thanks for this additional info. It shouldn’t be necessary to modify resolv.conf or kubeadm.conf. The bootstrap.sh script handles core DNS and K8s cluster networking and configures the system so you can use kubectl, helm, and the clara CLI as a regular user.

I’m suspicious that running clara with sudo complicates the config (testing with sudo on my own installation results in different errors).

Another potential complication is any preexisting kubernetes network config. Was kubernetes previously installed on this machine? If so, you will probably need to kubeadm reset, drop any existing K8s network config, flush iptables, and re-bootstrap.

Thanks,
Kris

I deleted kubernetes, kubelet, kubeadm, and flushed iptables.
and execute bootstrap.sh again.

but encountered same errors.
I fixed config files to its original state.

$ sudo clara platform start

sudo su clara platform start
Usage:
platform start [flags]

Flags:
-h, --help help for start

Global Flags:
–config string config file (default is $HOME/.clara/config.yaml)
–verbose verbose output

Error: forwarding ports: error upgrading connection: pods “tiller-deploy-659c6788f5-n8jlw” is forbidden: User “system:node:nanoelfin” cannot create resource “pods/portforward” in API group “” in the namespace “kube-system”

should I remove helm and run bootstrap.sh again?
but I can’t because of this error, any ‘helm’ command doesn’t work.

so I did
rm -rf /usr/local/bin/helm ./bootstrap.sh

and not solved. eventually there are more errors

$ clara pull platform
Error: Looks like “https://helm.ngc.nvidia.com/nvidia/clara” is not a valid chart repository or cannot be reached: Failed to fetch https://helm.ngc.nvidia.com/nvidia/clara/index.yaml : 401 Unauthorized

Sorry for the trouble with this. Will you share the state of kubernetes with the following?
kubectl get pods -o wide -A
kubectl get deployments -A
kubectl get svc -A

Thanks,
Kris

Hi kkersten,

I get this problem too. I am pretty sure that my APIA-KEY and the username is correct, generated by NGC setup. When I would like to run “clara pull platform”, it returns “Error: Looks like “https://helm.ngc.nvidia.com/nvidia/clara” is not a valid chart repository or cannot be reached: Failed to fetch https://helm.ngc.nvidia.com/nvidia/clara/index.yaml : 401 Unauthorized”.
There are some states:

I solved other problem, but still

$clara pull platform
Error: Looks like “https://helm.ngc.nvidia.com/nvidia/clara” is not a valid chart repository or cannot be reached: Failed to fetch https://helm.ngc.nvidia.com/nvidia/clara/index.yaml : 401 Unauthorized

I guess I am wrong in the command below.

$ clara config --key API_KEY --orgteam nvidia/clara [–username USERNAME] [–name SECRET_NAME] -y

what exactly do ‘USERNAME’ and ‘SECRET_NAME’ mean?

I typed like this
USERNAME: NGC account name(not email)
SECRET_NAME: any letter

Hi Alex,

Glad to see you were able to resolve the Kubernetes issue.

For the NGC configuration, you can omit username and secret name. Try using just:
$ clara config --key <your NGC API KEY> --orgteam nvidia/clara

Thanks,
Kris

Hi kkersten,

I tried the command you shared above to NGC configuration. But it still fails when I tried to pull the platform. I just generated a new API key so the key should not expire.

so do I.

By the way, Are NGC_API_KEY and NGC_CLI_API_KEY separate? While looking for a solution, I found the NGC Regitry CLI, but when I used the API key of Setup, the following error occurred.

“NGC_API_KEY is a deprecated environment variable. Please use NGC_CLI_API_KEY instead.”
Error: Invalid apikey

Hi Alex, jingyq1,

There was an issue with permissions in the nvidia/clara repository on NGC. This is now resolved.

If you are still having issues accessing the platform helm chart, you may need to delete the Kubernetes secret and re-run
clara config --key <your NGC API KEY> --orgteam nvidia/clara

To find the existing K8s secret, run kubectl show secrets and look for the key ngc-clara, of type kubernetes.io/dockerconfigjson. Delete using kubectl delete secret ngc-clara. Note that ngc-clara is the default name; this could be different depending on your config.

Thanks,
Kris