AWS Instance overwrite NVIDIA Drivers for Isaac Sim 2020.1

Hi
i have really big problems with the AWS Cloud implementation.

Im using the AWS omniverse-gpu-headless-ubuntu-2020-02-07 - ami-0892a34ce0ee37099 - Ubuntu 18.04 + Docker + Nvidia driver 440.59 + Amazon ECS and the g4dn.4xlarge gpu.

Setup works fine, first login works fine, i can setup the NVIDIA Docker and the scripst and run the application.

But i have Problems with:
overwritting the NVIDIA Driver after updating the instance or reboot it after NVIDIA Setup, so everytime i want to use the application i have to setup new instance and run all again in the hope its not crashing and i need to reboot.

  1. When i login in the new blank instance, i can see 118 updates and many security updates/upgrades, so i run

sudo apt-get update -y
sudo apt-get upgrade -y linux-aws

as described in AWS Documentation

sudo reboot

then i get the error NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

the same i installed the NVIDIA Scripts and do reboot

i checked your aws tutorials and also found this tutorial to update the drivers:

https://docs.aws.amazon.com/de_de/AWSEC2/latest/UserGuide/install-nvidia-driver.html#nvidia-GRID-driver

but this tutorial seems to be old. from the part aws s3 cp --recursive s3://ec2-linux-nvidia-drivers/g4/latest/ . it is not working and, gives errors.

Do anybody knows how to prevent or solve this NVIDIA 440 Driver overwritting. it took me days since now.

Alexander

We have an updated AMI,

Isaacsim-Ubuntu-18.04-GPU-2020-05-14

It’s in the us-west-1 region. Please give this a try.

1 Like

ok Problem is solved, AWS send me a way to reinstall all

I am having the exact same problems as AndrewK with with the AMI you suggested
Isaacsim-Ubuntu-18.04-GPU-2020-05-14 - ami-0e1f9ffb088c92836
Ubuntu 18.04 + ECS + Nvidia driver 440.64.00 + Nvidia container runtime

When I first install the instance, nvidia-smi shows the drivers.
NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2

If I update the instance before running docker the drivers are gone. I found out also if I reboot the instance they are gone.

The other problem I am having is when I run the docker line
sudo docker run --gpus all -e “ACCEPT_EULA=Y” --rm -p 47995-48012:47995-48012/udp -p 47995-48012:47995-48012/tcp -p 49000-49007:49000-49007/tcp -p 49000-49007:49000-49007/udp nvcr/nvidia/isaac-sim:2020.1_ea

It runs to a certain point and then the cursor freezes.

2020-06-25 02:34:54 [25,075ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload successfully uploaded /isaac-sim/_build/target-deps/kit_sdk_release/_build/linux-x86_64/release/resources/icons/details_options.png, Width=63 Height=55
2020-06-25 02:34:54 [25,076ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload successfully uploaded /tmp/carb.PAhjLZ/filter.28x28.png, Width=28 Height=28
2020-06-25 02:34:54 [25,077ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload successfully uploaded /tmp/carb.PAhjLZ/eye_header.28x28.png, Width=28 Height=28
2020-06-25 02:34:54 [25,079ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload successfully uploaded /tmp/carb.PAhjLZ/Plus.20x20.png, Width=20 Height=20
2020-06-25 02:34:54 [25,080ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload successfully uploaded /tmp/carb.PAhjLZ/link.40x40.png, Width=40 Height=40
2020-06-25 02:34:54 [25,081ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload successfully uploaded /tmp/carb.PAhjLZ/Xform.40x40.png, Width=40 Height=38
2020-06-25 02:(CURSOR FROZEN HERE)

Then I have to restart the instance, and when I do that the nvidia drivers are missing and I have to start over - new instance. I have done this 4 times and have yet to get a remote connection. When I run the remote client it says server URL is invalid, but I think this is because docker never finishes. Any ideas?

If I don’t restart and reconnect new putty termina and kill the previous container that is locked up
sudo docker container ls
sudo docker rm -f
launch it again
sudo docker run --gpus all -e “ACCEPT_EULA=Y” --rm -p 47995-48012:47995-48012/udp -p 47995-48012:47995-48012/tcp -p 49000-49007:49000-49007/tcp -p 49000-49007:49000-49007/udp nvcr/nvidia/isaac-sim:2020.1_ea

2020-06-25 03:06:35 [7,685ms] [Warning] [omni.usd] Warning: in _ProcessFil{FROZEN HERE}

reconnect new putty terminal
sudo docker container ls
sudo docker rm -f
sudo docker run --gpus all -e “ACCEPT_EULA=Y” --rm -p 47995-48012:47995-48012/udp -p 47995-48012:47995-48012/tcp -p 49000-49007:49000-49007/tcp -p 49000-49007:49000-49007/udp nvcr/nvidia/isaac-sim:2020.1_ea

2020-06-25 03:09:32 [7,556ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload
2020-06-25 03:09:32 [7,557ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload
2020-06-25 03:09:32 [7,558ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload
2020-06-25 03:09:32 [7,559ms] [Warning] [omni.usd] Warning: in _ProcessFile at l(FROZEN HERE]

reconnect new putty terminal
udo docker container ls
sudo docker rm -f
sudo docker run --gpus all -e “ACCEPT_EULA=Y” --rm -p 47995-48012:47995-48012/udp -p 47995-48012:47995-48012/tcp -p 49000-49007:49000-49007/tcp -p 49000-49007:49000-49007/udp nvcr/nvidia/isaac-sim:2020.1_ea

2020-06-25 03:13:05 [7,678ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload successfully uploaded /tmp/carb.woqqgu/Xform.40x40.png, Width=40 Height=38
2020-06-25 03:13:05 [7,678ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload successfully uploaded /tmp/carb.woqqgu/link.40x40.png, Width=40 Height=40
2020-06-25 03:13:05 [7,680ms] [Warning] [omni.usd] Warning: in _ProcessFile at line 240 of /buildAgent/work/da639afa0455b478/USD/pxr/imaging/lib/hio/glslfx.cpp – File doesn’t exist: “”
2020-06-25 03:13:05 [7,680ms] [Warning] [omni.usd] Warning: in _ProcessFil[FROZEN HERE AGAIN)

Hi, does it work before doing the update?

Here’s the recommended order of install, see here:

  1. Drivers
  2. Docker
  3. NVIDIA-Docker and NVIDIA container runtime

You might also try installing those on a base Ubuntu 18.04 AMI instead.

Note that there is a known issue with livestream not working.

No, it has never worked.

I tried several different instances today and could not get it to work.

I went back to the Isaacsim-Ubuntu-18.04-GPU-2020-05-14
updated
ran nvidia-sim
NVIDIA-SMI has failed because it couldn’t communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

manually installed the driver
wget …NVIDIA-Linux-x86_64-440.100.run
sudo sh NVIDIA-Linux-x86_64-440.100.run

sudo docker run … isaac-sim:2020.1_ea
It still locks up in docker here:

2020-06-25 18:23:46 [7,564ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload successfully uploaded /tmp/carb.OI9nxI/Xform.40x40.png, Width=40 Height=38
2020-06-25 18:23:46 [7,565ms] [Info] [rtx.resourcemanager.plugin] ScheduleUpload successfully uploaded /tmp/carb.OI9nxI/link.40x40.png, Width=40 Height=40
2020-06-25 18:23:46 [7,566ms] [Warning] [omni.usd] Warning: in _ProcessFile at line 240 of /buildAgent/work/da639afa0455b478/USD/pxr/imaging/lib/hio/glslfx.cpp – File doesn’t exist: “”
2020-06-25 18:23:46 [7,566ms] [Warning] [omni.usd] Warning: in _ProcessFil{FROZEN)

I previously tried to get it running locally, but only have a windows 10 machine with an RTX2080. I was able to get docker/GPU working in WSL2 but IssacSim would not work. I was hoping AWS would.

Omniverse-Kit-Remote
Thu Jun 25 18:34:06:983 ERROR [BifrostClient: Streamer] {00006700} - updateVideoSettingsForNVbProfile: profile 8 is not handled
Thu Jun 25 18:34:07:089 ERROR [NVST:ClientSession] {00007E10} - Server URL is invalid 52.200.104.216
Thu Jun 25 18:34:07:398 ERROR [NVST:ClientSession] {00007E10} - Number of channels(2) is not valid for surround configuration…

Hi. Sorry for the late reply.

Could you please provide more info like

  • which AMI was used
  • type of instance
  • AWS region
  • a full log at isaac-sim/_build/target-deps/kit_sdk_release/_build/linux-x86_64/release/data/Kit/isaac-sim-headless/<version_number>/omniverse-kit.log

You should see these two lines in the log if Isaac Sim is working in headless mode:

[Info] [carb.livestream.plugin] Stream Server: Net Stream Server Instance Created
[Info] [carb.livestream.plugin] Stream Server: streaming instance started - waiting for a client...

The lock up and warnings should be normal if you the see the two lines above in the logs.
However, please note that the Omniverse Kit remote and live streaming to an AWS instance may not be working. We are working on a fix.

Thu Jun 25 18:34:07:089 ERROR [NVST:ClientSession] {00007E10} - Server URL is invalid 52.200.104.216
Thu Jun 25 18:34:07:398 ERROR [NVST:ClientSession] {00007E10} - Number of channels(2) is not valid for surround configuration…

On the current Kit Remote version, those errors are normal too, even when live streaming is working.

For now, if AWS is your only option, I would suggest using a desktop Ubuntu 18.04 AMI and use NICE DCV to remotely access the instance. Then run the local non-headless Isaac Sim without docker.

I previously tried to get it running locally, but only have a windows 10 machine with an RTX2080. I was able to get docker/GPU working in WSL2 but IssacSim would not work. I was hoping AWS would.

Isaac Sim is currently not tested/supported in WSL.

Edited: Looks like Nvidia-docker and GPU is supported in WSL2 but with the Windows 10, version 2004 requirement. We’ll take a look at this too.

Im working through the setup guide, So if I understand correctly you can’t currently use AWS to run Issac Sim since you need the live stream to interact with the editor as shown in this guide:

https://docs.omniverse.nvidia.com/robotics/text/Robotics_First_Run.html#

Is that correct? Is there another way to run sample Robotics applications? I’m having trouble finding those guides and ready made application packages I can just run to test it out.

Hi.

Yes, that is correct. Live streaming with Kit Remote is not working only with AWS instances.

Here are some of the other methods to set up Isaac Sim:

  1. Local Linux Native
  2. Local Linux in Windowed Container
  3. Remote Linux Container and livestream using Local Kit Remote

Here’s a guide to use the default Robotics Samples:
https://docs.omniverse.nvidia.com/robotics/text/Robotics_Leonardo_Samples.html
https://docs.omniverse.nvidia.com/robotics/text/Robotics_UR10_Samples.html