Fine-tuning Peoplenet Resnet 34 on AWS. "failed to connect to vfs socket"

• Hardware: AWS EC2 g4dn.xlarge
• Network Type: peoplenet_vtrainable_v2.5 resnet34_peoplenet.tlt
• TLT Version: TAO version 5
• Training spec file
peoplenet34_heads.txt (3.1 KB)

• How to reproduce the issue:

Run with python 3.8 in jupyter notebook

tao model detectnet_v2 train -e $SPECS_DIR/peoplenet34_heads.txt \ -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \ -n resnet18_detector \ --gpus $NUM_GPUS \ -k tlt_encode

  • Error message:
2023-09-28 15:04:23,438 [TAO Toolkit] [INFO] tensorflow 692: global_step/sec: 1.78093
2023-09-28 15:04:26,926 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 7.085
INFO:tensorflow:epoch = 0.9122137404580152, learning_rate = 0.00049999997, loss = 0.013414521, step = 478 (5.830 sec)
2023-09-28 15:04:29,266 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9122137404580152, learning_rate = 0.00049999997, loss = 0.013414521, step = 478 (5.830 sec)
INFO:tensorflow:epoch = 0.933206106870229, learning_rate = 0.00049999997, loss = 0.016209295, step = 489 (6.025 sec)
2023-09-28 15:04:35,291 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.933206106870229, learning_rate = 0.00049999997, loss = 0.016209295, step = 489 (6.025 sec)
INFO:tensorflow:epoch = 0.9522900763358778, learning_rate = 0.00049999997, loss = 0.014432838, step = 499 (5.672 sec)
2023-09-28 15:04:40,964 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9522900763358778, learning_rate = 0.00049999997, loss = 0.014432838, step = 499 (5.672 sec)
2023-09-28 15:04:40,964 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 7.124
INFO:tensorflow:epoch = 0.9713740458015266, learning_rate = 0.00049999997, loss = 0.014740716, step = 509 (5.692 sec)
2023-09-28 15:04:46,656 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9713740458015266, learning_rate = 0.00049999997, loss = 0.014740716, step = 509 (5.692 sec)
INFO:tensorflow:epoch = 0.9904580152671756, learning_rate = 0.00049999997, loss = 0.016124992, step = 519 (5.687 sec)
2023-09-28 15:04:52,343 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9904580152671756, learning_rate = 0.00049999997, loss = 0.016124992, step = 519 (5.687 sec)
INFO:tensorflow:global_step/sec: 1.76236
2023-09-28 15:04:52,943 [TAO Toolkit] [INFO] tensorflow 692: global_step/sec: 1.76236
[1695913495.912078] [0ac105827284:216  :f]        vfs_fuse.c:424  UCX  WARN  failed to connect to vfs socket '': Invalid argument
2023-09-28 15:04:56,003 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.evaluation.evaluation 130: step 0 / 58, 0.00s/step
Execution status: FAIL
2023-09-28 15:05:08,039 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

Thank you to anyone that can be of any help :)

The “failed to connect to vfs socket” is not the reason. It is failed while running evaluation.
To narrow down, please change

validation_fold: 0

to below and retry.

validation_data_source: {
    tfrecords_path: "/workspace/tao-experiments/HFDData/tfrecordsOLD/coco_trainval/*"
    image_directory_path: "/workspace/tao-experiments/HFDData/HeadImages"
}
1 Like

Hi Morganh,

I’ve made that change but i’ve got the same result as before:

INFO:tensorflow:epoch = 0.9725557461406518, learning_rate = 0.00049999997, loss = 0.014038825, step = 567 (5.926 sec)
2023-10-02 09:08:50,604 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9725557461406518, learning_rate = 0.00049999997, loss = 0.014038825, step = 567 (5.926 sec)
2023-10-02 09:08:54,839 [TAO Toolkit] [INFO] nvidia_tao_tf1.core.hooks.sample_counter_hook 76: Train Samples / sec: 6.798
INFO:tensorflow:epoch = 0.9897084048027444, learning_rate = 0.00049999997, loss = 0.015056454, step = 577 (5.914 sec)
2023-10-02 09:08:56,518 [TAO Toolkit] [INFO] tensorflow 260: epoch = 0.9897084048027444, learning_rate = 0.00049999997, loss = 0.015056454, step = 577 (5.914 sec)
INFO:tensorflow:global_step/sec: 1.71379
2023-10-02 09:08:58,275 [TAO Toolkit] [INFO] tensorflow 692: global_step/sec: 1.71379
[1696237740.724623] [5af009ef5b65:217  :f]        vfs_fuse.c:424  UCX  WARN  failed to connect to vfs socket '': Invalid argument
2023-10-02 09:09:00,810 [TAO Toolkit] [INFO] nvidia_tao_tf1.cv.detectnet_v2.evaluation.evaluation 130: step 0 / 582, 0.00s/step
Execution status: FAIL
2023-10-02 09:09:12,877 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 337: Stopping container.

Do you have any other suggestions?

Thanks for your help!

To narrow down, could you run in a new terminal instead of notebook?
Step:

  1. Open a new terminal
  2. $ tao model detectnet_v2 run /bin/bash
  3. Then inside the docker, run the command.
    # detectnet_v2 train -e $SPECS_DIR/peoplenet34_heads.txt \ -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned \ -n resnet18_detector \ –gpus $NUM_GPUS \ -k tlt_encode

Thanks @Morganh. Just tried this and git the following:

(base) ubuntu@ip-172-31-15-235:~/liamd_HFD/HFD$ docker run -it nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

=======================
=== TAO Toolkit TF1 ===
=======================

NVIDIA Release 5.0.0-TF1 (build 52693369)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

root@6eaddc16b73a:/workspace# 
root@6eaddc16b73a:/workspace# detectnet_v2 train -e $SPECS_DIR/peoplenet34_heads.txt -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned-n resnet18_detector –gpus 1
2023-10-05 10:51:21.681482: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-10-05 10:51:22,032 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2023-10-05 10:51:26,802 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2023-10-05 10:51:26,940 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2023-10-05 10:51:26,961 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
Traceback (most recent call last):
  File "/usr/local/bin/detectnet_v2", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/entrypoint/detectnet_v2.py", line 12, in main
    launch_job(nvidia_tao_tf1.cv.detectnet_v2.scripts, "detectnet_v2", sys.argv[1:])
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 276, in launch_job
    modules = get_modules(package)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 47, in get_modules
    module = importlib.import_module(module_name)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/inference.py", line 19, in <module>
    from nvidia_tao_tf1.cv.detectnet_v2.inferencer.build_inferencer import build_inferencer
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/inferencer/build_inferencer.py", line 24, in <module>
    from nvidia_tao_tf1.cv.detectnet_v2.inferencer.trt_inferencer import DEFAULT_MAX_WORKSPACE_SIZE
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/inferencer/trt_inferencer.py", line 32, in <module>
    import pycuda.autoinit # noqa pylint: disable=unused-import
  File "/usr/local/lib/python3.8/dist-packages/pycuda/autoinit.py", line 1, in <module>
    import pycuda.driver as cuda
  File "/usr/local/lib/python3.8/dist-packages/pycuda/driver.py", line 66, in <module>
    from pycuda._driver import *  # noqa
ImportError: libcuda.so.1: cannot open shared object file: No such file or directory
root@6eaddc16b73a:/workspace# 

Please install nvidia driver.
$ sudo apt install nvidia-driver-525
$ sudo reboot

And try previous steps again.

I ran the above successfully outside the container and then ran

tao model detectnet_v2 run /bin/bash

and I see the following error:

Can you run $nvidia-smi and share the result?

Then, please
$ sudo apt purge nvidia* libnvidia*
$ sudo apt install nvidia-driver-525 nvidia-container-toolkit

Fri Oct  6 08:27:23 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   38C    P0    26W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Please
$ sudo apt purge nvidia* libnvidia*
$ sudo apt install nvidia-driver-525 nvidia-container-toolkit

Ok I’ve done that and then executed steps 2 and 3 from above again

(launcher3.8) (base) ubuntu@ip-172-31-15-235:~/liamd_HFD/HFD$ docker run -it nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

=======================
=== TAO Toolkit TF1 ===
=======================

NVIDIA Release 5.0.0-TF1 (build 52693369)
TAO Toolkit Version 5.0.0

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the TAO Toolkit End User License Agreement.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/tao-toolkit-software-license-agreement

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TAO Toolkit.  NVIDIA recommends the use of the following flags:
   docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...

root@bd23a4b37042:/workspace# detectnet_v2 train -e $SPECS_DIR/peoplenet34_heads.txt -r $USER_EXPERIMENT_DIR/experi
ment_dir_unpruned -n resnet18_detector –gpus 1
2023-10-06 15:14:16.946714: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-10-06 15:14:16,998 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2023-10-06 15:14:18,606 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2023-10-06 15:14:18,647 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2023-10-06 15:14:18,651 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
Traceback (most recent call last):
  File "/usr/local/bin/detectnet_v2", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/entrypoint/detectnet_v2.py", line 12, in main
    launch_job(nvidia_tao_tf1.cv.detectnet_v2.scripts, "detectnet_v2", sys.argv[1:])
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 276, in launch_job
    modules = get_modules(package)
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/common/entrypoint/entrypoint.py", line 47, in get_modules
    module = importlib.import_module(module_name)
  File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
  File "<frozen importlib._bootstrap>", line 991, in _find_and_load
  File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/scripts/inference.py", line 19, in <module>
    from nvidia_tao_tf1.cv.detectnet_v2.inferencer.build_inferencer import build_inferencer
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/inferencer/build_inferencer.py", line 24, in <module>
    from nvidia_tao_tf1.cv.detectnet_v2.inferencer.trt_inferencer import DEFAULT_MAX_WORKSPACE_SIZE
  File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_tf1/cv/detectnet_v2/inferencer/trt_inferencer.py", line 32, in <module>
    import pycuda.autoinit # noqa pylint: disable=unused-import
  File "/usr/local/lib/python3.8/dist-packages/pycuda/autoinit.py", line 1, in <module>
    import pycuda.driver as cuda
  File "/usr/local/lib/python3.8

Please use below.
$ docker run --runtime=nvidia -it nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

Ok I’ve tried again with $ docker run --runtime=nvidia -it nvcr.io/nvidia/tao/tao-toolkit:5.0.0-tf1.15.5

root@669cf0542c75:/workspace# detectnet_v2 train -e $SPECS_DIR/peoplenet34_heads.txt -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned -n resnet18_detector –gpus 1
2023-10-06 15:34:06.341459: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-10-06 15:34:06,393 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2023-10-06 15:34:07,912 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2023-10-06 15:34:07,950 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2023-10-06 15:34:07,953 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
usage: detectnet_v2 train [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS] [--gpu_index GPU_INDEX [GPU_INDEX ...]]
                          [--use_amp] [--log_file LOG_FILE] [-e EXPERIMENT_SPEC_FILE] [-r RESULTS_DIR] [-n MODEL_NAME]
                          [-v] [-k KEY] [--enable_determinism]
                          {train,prune,inference,export,evaluate,dataset_convert,calibration_tensorfile} ...
detectnet_v2 train: error: argument /tasks: invalid choice: '–gpus' (choose from 'train', 'prune', 'inference', 'export', 'evaluate', 'dataset_convert', 'calibration_tensorfile')
root@669cf0542c75:/workspace# 

Please set to
--gpus 1
or not set.

Ok added the extra - :

root@669cf0542c75:/workspace# detectnet_v2 train -e $SPECS_DIR/peoplenet34_heads.txt -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned -n resnet18_detector -–gpus 1
2023-10-06 15:41:14.161529: I tensorflow/stream_executor/platform/default/dso_loader.cc:50] Successfully opened dynamic library libcudart.so.12
2023-10-06 15:41:14,211 [TAO Toolkit] [WARNING] tensorflow 40: Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
Using TensorFlow backend.
2023-10-06 15:41:15,669 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use sklearn by default. This improves performance in some cases. To enable sklearn export the environment variable  TF_ALLOW_IOLIBS=1.
2023-10-06 15:41:15,705 [TAO Toolkit] [WARNING] tensorflow 42: TensorFlow will not use Dask by default. This improves performance in some cases. To enable Dask export the environment variable  TF_ALLOW_IOLIBS=1.
2023-10-06 15:41:15,708 [TAO Toolkit] [WARNING] tensorflow 43: TensorFlow will not use Pandas by default. This improves performance in some cases. To enable Pandas export the environment variable  TF_ALLOW_IOLIBS=1.
usage: detectnet_v2 train [-h] [--num_processes NUM_PROCESSES] [--gpus GPUS] [--gpu_index GPU_INDEX [GPU_INDEX ...]]
                          [--use_amp] [--log_file LOG_FILE] [-e EXPERIMENT_SPEC_FILE] [-r RESULTS_DIR] [-n MODEL_NAME]
                          [-v] [-k KEY] [--enable_determinism]
                          {train,prune,inference,export,evaluate,dataset_convert,calibration_tensorfile} ...
detectnet_v2 train: error: argument /tasks: invalid choice: '1' (choose from 'train', 'prune', 'inference', 'export', 'evaluate', 'dataset_convert', 'calibration_tensorfile')
root@669cf0542c75:/workspace# 

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Can you set explicit path and retry?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.