Installing kit 106 to 8 gpu L40S aws ok but does not launch

Operating System:
Windows
Linux

Kit Version:
107
106
105

Kit Template:
USD Composer
USD Explorer
USD Viewer
Custom

GPU Hardware:
A series (Blackwell)
A series (ADA)
A series
50 series
40 series
30 series

GPU Driver:
Latest
Recommended (573.xx)
Other 552.74

installing kit 106 to 8 gpu L40S aws ok but does not launch

Here are the error messages from command promtp:

Pekka

I updated to latest drivers but that caused a situation where Kit 106 or Compose did not lauched at all.

Here are the error messages:

So I downgraded the drivers to 552.74

But that did not help. Same error message comes.

That is strange since I have another VM with just one GPU

The driver version 552.74 works there.

I used this driver to downgrade:

Please help me to run Kit 106 in this multigpu L40S at AWS

Pekka

It’s not a kit problem. It’s the way you have setup or not set up the server. In the log it says you are running all your GPUs in the wrong mode. You have ECC ON, which we recommend turning off.

You are going to have to call whoever you are renting the server from and get it converted over for proper GPU acceleration.

And I would recommend either the correct official driver, 537.70 or the latest. One or the other.

Also you have several other error messages in here. Your material database is maxed out and overflowing. Are you trying to load a really large scene?

Here’s what I would try

  1. talk to whoever we owns the server and get all the GPUs set to ECC = OFF
  2. update the latest driver
  3. make sure all the GPUs are in WDDM mode
  4. verify everything with Nvidia SMI
  5. just start Composer and see if even boots up with that many cards. 8 is a lot. Just so you know, we do not really take advantage of that many cards in parallel for realtime rendering. Just for path tracing.
  6. if it starts up, try some basic scenes. Increase the scene size and complexity slowly to see what it can handle in parallel. REMEMBER more cards does not always equal more performance. Each video card has to load the scene and the materials independently. That is a LOT of memory that has to run through the cpu.
  7. if you want to try a larger scene with “materials loading disabled” that would help confirm.
  8. if all else fails it may be an issue with one of GPUs in hardware. You can try disabling multi gpu in kit. Or disable each gpu, one at a time in Device manager until the Kit app does not crash

Thanks!

I take care of the steps 1-4 with AWS support. While we wait for that to happen, I add this information here:

I have Composer 2023.2.5 running and rendering just fine on exactly same kind of 8 x L40S VM on AWS.

The issue is that after I realized that Kit 106 will not boot up (the first message on this topic ) I updated the drivers on this VM.

Then I downgraded the, but still even the Composer will not run since I get that error about “Failed to create any GPU devices..”
So it looks like the installation / re-installation the NVIDIA drivers set those values from steps 1-4.
Maybe…

Pekka

Hello Richard!

This has been a very long journey with AWS support, and now I am back here for you.

So I started to work with the settings you gave me above, we did that together with AWS support. We did all sort of tricks with Disable ECC. But those tricks needed to be done with command prompt for each gpu ( 8 on this case ) and also re-boot was required every time to get it looking like this:

Attached GPUs : 8
GPU 00000000:9E:00.0
ECC Mode
Current : Disabled
Pending : Disabled

So then after realizing how diffigult and slow it is for me, AWS support desiced to go this way:

" My name is Sumit.
I tried the installation of Omniverse USD composer and was able to install and launch successfully on g6e.8xlarge.
Initially when the Nvidia drivers were not installed, I got the same error as you. After installing the GRID drivers available from AWS, the installation went through. Can you install the latest NVIDIA GRID drivers following below article and try installing the composer again. https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html#nvidia-GRID-driver
"

Richard, I tried to follow those docs but I got an issue with:
no credentials configured to access the S3 bucket

That was solved finally by other method: assign an instance profile

So I did that with creatign a role and attaching that to my VM, and with that I could use the power shell to access buckets.
So I installed the GRID drivers that AWS support guy Sumit wanted me to do.

Composer runs fine, but with very poor multi-gpu performance.
Check that out:

Kit 106 is not booting up at all..
That is the only reason I would need those multi gpu´s

Kit 107 boots but multi GPU performance is still bad with it also.
New issue arises with testing the Kit 107 rendering.
Horizontal lines, usually they are fixed by deactivation of the cache in Path Traced rendering, but now it does not work.

See the video:

I also made test renders to see it it really is there and yes the lines are there:

So are these Grid Drivers ok for me?
If they are, what should I do to make the performance better & solve the line rendering issue?

Pekka

Pekka, that is not really how multi gpu works. You cannot give 8 GPUs a scene to test that literally takes 1 gpu 3 seconds to render. There is no point. The other gpus never get chance to even start the job. With Multi GPU, you have to make it worth while. Just copying the raw code from one gpu to the other 7 gpus, takes a few seconds.

Give it a proper scene to test. Something that takes at least 8 mins to render per frame. Then that should render ideally at 1-1.5 min a frame on 8 gpus. If you are not running them all for at LEAST 30 seconds it is not worth it. Just the cycle up and cycle down time is 5 seconds.

People assume more power is more speed. I have said this before. No. If you put 8 engines in your car, it does not mean you can drive 8 times as fast.

Try again with heavy long scenes. Obviously this is just for path tracing. Realtime will not use of of those. Maybe the first 2 GPUs.

You are always better to rent 4 servers with 2 gpus per server, than a single server with 8 gpus. It very hard to get all that performance out of one machine, unless it is our own Nvidia OVX cluster. You have no idea how this machine is configured. It could have the cheapest, slower ram, cpu and hard drive you can get. The motherboard is likely maxed out already. Building such a massive machine is a massive task. There are very few true 4 way machines, let alone 8 way machines. It not as simple as putting 8 powerful gpus in a machine and expecting perfect 8 way performance. And honestly if you are not, then that is likely a hardware issue with AWS.

You can split the work out more efficiently across regular machines with 1 or 2 gpus. That is why we have render farms. If you have 8000 frames to render, you give each machine 1000 frames to render. You still get the work 8 times faster, but each card is maxed out at 100% the whole time.

If you are renting this machine from AWS, maybe look at ending that and switching to a cluster of machines, each with 2 GPUs.

In regards to the horizontal lines, yes that is bad. Really bad. Our apologies. That is a possible bug with MPGU if you don’t have the right driver. We need to look into that seriously. You are saying that it does not happen with Kit 106, but it does with Kit 107? What exact version? What exact version of the video drivers?

Can you disable 6 of the gpus, in the Device Manager, and just try that again with 2 gpus? Then keep adding until you get the lines.

Thanks for a long and complete explanation, Richard.

I need this only for a path traced character renders, that I can easily combine with RTX real time renders. So the issue With kit 106 not booting up at all is a serious problem and it forces me to really start using that render farm.

Before I start that journey, I gather you all information about the issue. Do you want me to make another topic about this or can I continue that discussion here on this topic?

At least the title would be correct still..

But now you have kit 107 working on it, what is the issue with kit 106? I have never heard of a difference between the two. If one works, the other should work. Show me a video of both 107 booting up and 106 booting up.

Yes, I shall do that video, logs and all info tomorrow..

My feeling is you do not a 8 GPU machine. You would be better off renting or building a small render farm. 1-2 GPUs each at most.

Also I really think that you need to move toward Realtime 2.0 for characters and get away from Path Tracing. I have tested characters here with RT2.0 and they look better than the path tracing. To be honest, I have never understood why you need to use path tracing. We have so many demos of characters that run perfectly at 60 frames a second in real time.

Amazon offers an OFFICIAL EC2 installation of NVIDIA Omniverse. It is a specific installation that you request. It comes preconfigured with all the right GRID drivers all the right software, and all the settings perfect.

As I said before, if this server is not working for you, and Amazon cannot help you, I would simply return this server and stop renting it. Then I would get an official AWS Omniverse, EC2 configured set up. Then everything should run flawlessly. Stick with a 2 GPU or a 4 GPU system that is not “bare metal” and set up by you, but officially managed through our Nvidia Amazon Partnership.

Another issue that I have mentioned before is You are trying to make workflow work across three separate versions of kit. 2023.2.5, kit 106.5 and kit 107. It is very hard to diagnose any of this with you changing versions.

If 2023.2.5 works then stick with that. I think also you need to go Enterprise. You are doing advanced workflow.

This is what you want. An official AMI install

And this

Super. So now we know what to do with AWS.
I have spent days figuring out all those settings..

I have been aware of these official AMI instances but when I tried to create them, they were not compatible with Inception Program credits, but that was back 2-3 years ago when we started. I check that out now, hope they have changed the rules!

Anyway the credits are soon expired and we have use real money anyways 😅

About the usage of path tracing on multi GPU, you are totally right. I believe some customers really been that speed and do not care about the extra cost but most of them are not happy burning money that much.

Yesterday I also checked my current character work on RTPT and it looked pretty good on his own lights since the scene I had open was the character-only light setup. I believe you really can work with so extremely good digital human avatars, next level stuff!!

Do you have any estimation when RTPT supports movie export?

I try to get Cineshare to Enterprise at this fall. I understand we really need it.

the reasons I use Kit 106 are:

1 - Blendshapes

Has this thing being reported as a bug?
Should I do it? You told me use already work with characters on RTPT and I guess you use kit 107 ??

2 - share files between Composer and Isaac Sim

I am not sure what you mean. The link is fine. AWS Marketplace: NVIDIA Omniverse™ Enterprise Workstation (Windows)

You have to use the ENTERPRISE link. The workstation one is for training only. Not for commercial use. You really need to be OV Enterprise.

We already solved Facial Blendshapes in kit 106.5 and kit 107. I gave you the solution and you said it worked.