[Nvflare] Jobs not being submitted to the correct server when hosting two servers on the same machine

250002547 · June 19, 2024, 8:24am

Description

Hello,

I’m trying to set up two NVIDIA FLARE servers on the same physical machine, but jobs submitted to one server are submitted to the other, which then fails due to job signature verification.

I have configured two NVIDIA FLARE servers on the same physical machine with the same FQDN name but with different associated ports. Let’s call them server A and B. Server A uses ports 8004 as feed_learn_port and port 8005 as admin_port, while server B uses ports 8020 and 8021, respectively. Server A’s clients run on the same physical machine as Servers A and B, while Server B’s clients run on two other machines.

Every time I try to submit a job to Server B, through its admin Client Line Interface (CLI), it does not reach Server B, since nothing appears in its logs, but it reaches Server A where it fails due to job signature verification :

DefaultJobScheduler - INFO - [identity=my-project-A, run=?]: Try to schedule job c14323f5-ced0-4ab6-b57c-fad40017ea55, get result: (scheduled).
JobRunner - INFO - [identity=my-project-A, run=?]: Got the job: c14323f5-ced0-4ab6-b57c-fad40017ea55 from the scheduler to run
JobRunner - ERROR - [identity=my-project-A, run=?]: Failed to run the Job (c14323f5-ced0-4ab6-b57c-fad40017ea55): RuntimeError: Failed to verify app ‘my-app’: job signature verification failed

In server B admin JSON, the overseer_agent has the correct sp_end_point set as:

“FQDN-Name:8020:8021”

Could you help me understand why server B jobs are sent to server A and how to fix this problem, please ?

Thanks in advance,

DI MARIA, Franco Martin

Environment

Nvflare Version: 2.3.8
GPU Type: Quadro RTX 6000
Nvidia Driver Version: 460.80
CUDA Version: 11.2
CUDNN Version: 7.6.5
Operating System + Version: Ubuntu 18.04.3 LTS
Python Version (if applicable): 3.8.0

AakankshaS · June 29, 2024, 11:32am

Hi @250002547 ,
This forum talks about issues related to Tensorrt. I am afraid, i might not be able to help here.

Thanks

Topic		Replies	Views
TensorRT Integration Speeds Up TensorFlow Inference Technical Blog	40	825	March 27, 2020
import tensorrt as trt fails TensorRT	2	4339	June 10, 2018
Configuring multiple versions of TensorRT and Tensorflow on HPC share cluster; TF-TRT Warning: Cannot dlopen some TensorRT libraries TensorRT	8	13113	June 28, 2023
TensorRT 3.0.2 with multi-streaming TensorRT	3	2814	September 10, 2018
Number of eligible GPUs: 0. Fails using a NGC Container TensorRT	2	1429	October 12, 2021
Running 2 models on the same GPU with TensorRT TensorRT	7	1252	January 15, 2021
Uff转engine失败 Jetson Xavier NX tensorrt	3	567	March 1, 2022
Debug TensorRT loading correctly? TensorRT	4	1662	October 11, 2019
error when run two different context parallel in TensorRT7 TensorRT	1	1057	February 19, 2020
Adding multiple inference on TensorRT (Invalid Resource Handle Error) TensorRT	2	1712	December 4, 2019

[Nvflare] Jobs not being submitted to the correct server when hosting two servers on the same machine

Description

Environment

Related topics