Please provide the following information when requesting support.
• Hardware (GeForce RTX 4070)
• Network Type Classification (classification_pyt, pretrained_fan_hybrid_small)
• tao Version 5.2
I’m trying to run the example notebook classification_pyt train, without any changes. When I run the train command, I get the following error:
OCI runtime exec failed: exec failed: unable to start container process: exec: “classification_pyt”: executable file not found in $PATH: unknown
2024-01-02 15:42:42,245 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.
How can I solve this error and train the model using this notebook?
Using this new one, and then running classification_pyt train with the spec file from the notebook, I get the following error:
Error executing job with overrides: []
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 132, in main
raise e
File "/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py", line 118, in main
results_dir = os.path.join(cfg.results_dir, "train")
omegaconf.errors.MissingMandatoryValue: Missing mandatory value: results_dir
full_key: results_dir
object_type=ExperimentConfig
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 373) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/usr/local/lib/python3.8/dist-packages/nvidia_tao_pytorch/cv/classification/scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-01-04_08:20:25
host : edba0f4d209a
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 373)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: module 'urllib3.exceptions' has no attribute 'SubjectAltNameWarning'
Execution status: FAIL
With the -r argument it runs, but at the end of the epochs it gives a warning:
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: module ‘urllib3.exceptions’ has no attribute ‘SubjectAltNameWarning’
Execution status: PASS
It saves the epochs as .pth and some json files, but no .tlt files or equivalent. Is that supposed to be the case?