Worker consistently losing connection

2025-03-27 15:51:02 [201,604ms] [Error] [aodt.common.utils] Worker has lost connection.

I am consistently getting this error and I do not know where to start troubleshooting with this. i see the backend GPU being used for the job but then it drops connection.

it is also accompanied by this warning alot of the time:

2025-03-27 15:55:30 [469,572ms] [Warning] [aodt.configuration.worker_manager] [3da5e499-c067-a54b-d348-4dc6fbe920bd] Cannot find A* path from: (-1874.88, 5796.81, 310) to: (2254.52, 671.217, 2110)
2025-03-27 15:55:30 [469,687ms] [Warning] [aodt.configuration.worker_manager] [3da5e499-c067-a54b-d348-4dc6fbe920bd] Cannot find A* path from: (15183.7, 4940.4, 20410) to: (12930.1, 5670.96, 21310)

Hi @michael.chiaramonte
What is your backend and frontend GPU configurations?
are they running on the same server or separate machines?

My configuration is the Microsoft Azure specification from the installation documentation as i deployed it by following those instructions so it is two separate VMs. My backend has the A100 80GB GPU and the frontend has the A10 based on those virtual machine configurations that was in the Azure installation instructions.

when i reach a certain simulation size, the “backend-connector-1” docker container crashes and restarts which is what causes the worker to disconnect. Not sure if this is more because simulation is too large or some other parsing error is causing the crash, but it could be a memory issue. Not sure what the theoretical simulation size limit is for the recommended Azure configuration.

Hi @michael.chiaramonte
Please try to reduce the simulation size to 10 RUs with 100 UEs and a 4TRX antenna configuration and see if the connection issue still occurs.

I was able to fix that issue it was the size of the simulation making it crash.