Worker out of memory

Hello, I am running simulation AODT with Ubuntu 22.04.4
Here I use Dell Workstation 7920T having Quadro RTX 8000, with driver version 535.183

The device list from nvidia-smi is like this one:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 8000                Off | 00000000:17:00.0  On |                  Off |
| 71%   85C    P2             187W / 260W |   4501MiB / 49152MiB |     54%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000                Off | 00000000:73:00.0 Off |                  Off |
| 34%   41C    P8              22W / 260W |   2658MiB / 49152MiB |     16%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000                Off | 00000000:D5:00.0 Off |                  Off |
| 33%   33C    P8              13W / 260W |   2658MiB / 49152MiB |     17%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

I follow the tutorial from documentation to run tokyo.usd simulation for 1RU and 1UE. However, whenever I try to start UE mobility, few seconds after, there are Error Message from Worker(1), which states “Problem initializing EM solver: out of memory. Recommended checking your scenario settings and GPU memory usage on compute node”

Also, when I connect the worker through gear icon, it shows a warning “warning : can’t initialize nvml”

Any idea about the problem, since I was doing from fresh installation of the ubuntu, and the GPU seems to have large memory (48GB).

Minor thing, but I’m quite familiar with ISAAC Sim, in which it has some terminal that show the log activity of the simulation including the error. Where can I see the log activity in AODT.?

Thank you.

@deny_na Thank you for the useful details. When you start the worker on your back end with docker-compose command, it automatically starts with the logs displayed on the console. There are no additional logs written out by the worker. Can you please send those?
Which instance of the MIG GPU are you using for AODT? Are other GPU in the MIG configuration being used by someone else?

Thank you for your reply.

The output from console are:

2024-06-25 14:11:39 [Error] [aerial_sim.configuration.utils] Problem initializing EM solver: out of memory.
2024-06-25 14:11:39 [Error] [aerial_sim.configuration.utils] Recommend checking your scenario settings and GPU memory usage on compute node.
2024-06-25 14:11:39 [Error] [aerial_sim.configuration.worker_manager] [bc14473c-6eb3-1046-c1ed-9f821dee5626] Problem initializing EM solver: out of memory.
2024-06-25 14:11:39 [Error] [aerial_sim.configuration.worker_manager] Recommend checking your scenario settings and GPU memory usage on compute node.

My scenario is as:

Thanks

@deny_na
It seems that the GPUs are shared- there is a latent memory usage in all three instances. You are probably not meeting the 48G memory requirement. Can you run AODT on any 1 GPU, exclusively so that you have access to full 48G memory?
Alternatively, you can scale down the size of your deployment. You can reduce the #of emitted paths to 10 and # of scene interactions to 3.

Hello,
Thank you for the answer.

Here, I reinstall the ubuntu, from previously using desktop version, now I am using Ubuntu Server 22.04.3. And thus I have GPU with no latent memory used. From nvidia-smi I got

quadroserver@quadro:~$ nvidia-smi
Thu Jun 27 04:44:53 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 8000                Off | 00000000:17:00.0 Off |                  Off |
| 33%   49C    P8              30W / 260W |      1MiB / 49152MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000                Off | 00000000:73:00.0 Off |                  Off |
| 34%   38C    P8              15W / 260W |      1MiB / 49152MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000                Off | 00000000:D5:00.0 Off |                  Off |
| 33%   38C    P8              17W / 260W |      1MiB / 49152MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

After opening the AODT, the GPU usage is as:

quadroserver@quadro:~$ nvidia-smi
Thu Jun 27 08:49:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 8000                Off | 00000000:17:00.0 Off |                  Off |
| 33%   49C    P8              34W / 260W |  43985MiB / 49152MiB |     33%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Quadro RTX 8000                Off | 00000000:73:00.0  On |                  Off |
| 34%   42C    P8              29W / 260W |   2577MiB / 49152MiB |     11%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Quadro RTX 8000                Off | 00000000:D5:00.0 Off |                  Off |
| 33%   37C    P8              19W / 260W |   2577MiB / 49152MiB |     14%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A     70969      C   ./aodt_sim                                40790MiB |
|    0   N/A  N/A     73043    C+G   ...cal/share/ov/pkg/asim-1.0.0/kit/kit     3187MiB |
|    1   N/A  N/A     73043    C+G   ...cal/share/ov/pkg/asim-1.0.0/kit/kit     2570MiB |
|    2   N/A  N/A     73043    C+G   ...cal/share/ov/pkg/asim-1.0.0/kit/kit     2570MiB |
+---------------------------------------------------------------------------------------+

So when I set up the simulation accordingly, and execute the UE mobility, few seconds after, there are Error Message from Worker(1), which states “Problem initializing EM solver: out of memory. Recommended checking your scenario settings and GPU memory usage on compute node”

The console output the same logs as

2024-06-27 08:48:01  [Error] [aerial_sim.configuration.utils] Problem initializing EM solver: out of memory.
2024-06-27 08:48:01  [Error] [aerial_sim.configuration.utils] Recommend checking your scenario settings and GPU memory usage on compute node.
2024-06-27 08:48:01  [Error] [aerial_sim.configuration.worker_manager] [1bf3adf0-dc51-21d9-2463-718edfc2f038] Problem initializing EM solver: out of memory.
2024-06-27 08:48:01  [Error] [aerial_sim.configuration.worker_manager] Recommend checking your scenario settings and GPU memory usage on compute node.

I wonder like how many GB memory are needed to run these simulations.?
And do you think that my facility does not meet the minimum requirements.?

Thank you

Hello,
Actually we are in great interested of using this AODT, we may acquire the newer GPU type with larger memory to cope with the problem. However, since we still cant run simulation by our own, I have questions related to the simulation.

  1. In the EM simulation, is it also simulate a refraction event.? for example, when the EM hits the wall, some can penetrate through the wall, and UE behind the closed room can get the RF.
  2. Can AODT run in real-time with acceptable speed to generate EM at given position of the UE.?

And regarding to my simulation, is there any small version of example that can be run on 48GB RTX8000 GPU.?

Thank you

Hi @deny_na

  1. Refraction is not supported in current version of the AODT. In the future, we may consider introducing it.
  2. AODT is not a real-time tool. However, it is many orders of magnitude faster than classical simulation tools

48gb is sufficient. Your issue does not appear to be memory. We have not classified RTX8000, so we are take a deeper look into your specific deployment. In meantime, as i mentioned previously, you can try the following:
You can reduce the #of emitted paths to 10 and # of scene interactions to 3.
We will get back to you with a diagnosis of your issue asap.

@deny_na RTX 8000 has a compute capability of 7.5. We don’t support any compute capability below 8.0. We recommend the GPUs in our documentation. You can find the compute capabilities here CUDA GPUs - Compute Capability | NVIDIA Developer .Unfortunately we don’t have a recommendation for a lower version of GPU of this time.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.