[Nvflare] Jobs not being submitted to the correct server when hosting two servers on the same machine

Description

Hello,

I’m trying to set up two NVIDIA FLARE servers on the same physical machine, but jobs submitted to one server are submitted to the other, which then fails due to job signature verification.

I have configured two NVIDIA FLARE servers on the same physical machine with the same FQDN name but with different associated ports. Let’s call them server A and B. Server A uses ports 8004 as feed_learn_port and port 8005 as admin_port, while server B uses ports 8020 and 8021, respectively. Server A’s clients run on the same physical machine as Servers A and B, while Server B’s clients run on two other machines.

Every time I try to submit a job to Server B, through its admin Client Line Interface (CLI), it does not reach Server B, since nothing appears in its logs, but it reaches Server A where it fails due to job signature verification :

DefaultJobScheduler - INFO - [identity=my-project-A, run=?]: Try to schedule job c14323f5-ced0-4ab6-b57c-fad40017ea55, get result: (scheduled).
JobRunner - INFO - [identity=my-project-A, run=?]: Got the job: c14323f5-ced0-4ab6-b57c-fad40017ea55 from the scheduler to run
JobRunner - ERROR - [identity=my-project-A, run=?]: Failed to run the Job (c14323f5-ced0-4ab6-b57c-fad40017ea55): RuntimeError: Failed to verify app ‘my-app’: job signature verification failed

In server B admin JSON, the overseer_agent has the correct sp_end_point set as:

“FQDN-Name:8020:8021”

Could you help me understand why server B jobs are sent to server A and how to fix this problem, please ?

Thanks in advance,

DI MARIA, Franco Martin

Environment

Nvflare Version: 2.3.8
GPU Type: Quadro RTX 6000
Nvidia Driver Version: 460.80
CUDA Version: 11.2
CUDNN Version: 7.6.5
Operating System + Version: Ubuntu 18.04.3 LTS
Python Version (if applicable): 3.8.0

Hi @250002547 ,
This forum talks about issues related to Tensorrt. I am afraid, i might not be able to help here.

Thanks