Why cv task cannot work with NVIDIA TAO Toolkit 3.0

(launcher) xgy@xgy:~$ tao detectnet_v2 train --help
2021-09-08 14:36:21,791 [INFO] root: Registry: [‘nvcr.io’]
2021-09-08 14:36:21,882 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/xgy/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
2021-09-08 14:36:23,290 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

it cannot show the help info

I cannot reproduce your result.

$ tao detectnet_v2 train --help
~/.tao_mounts.json wasn’t found. Falling back to obtain mount points and docker configs from ~/.tlt_mounts.json.
Please note that this will be deprecated going forward.
2021-09-08 14:57:15,369 [INFO] root: Registry: [‘nvcr.io’]
2021-09-08 14:57:19,576 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/morganh/.tlt_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
Using TensorFlow backend.
usage: detectnet_v2 train [-h] [–num_processes NUM_PROCESSES] [–gpus GPUS]
[–gpu_index GPU_INDEX [GPU_INDEX …]] [–use_amp]
[–log_file LOG_FILE] [-e EXPERIMENT_SPEC_FILE]
[-r RESULTS_DIR] [-n MODEL_NAME] [-v] -k KEY
{calibration_tensorfile,dataset_convert,evaluate,export,inference,prune,train}

optional arguments:
-h, --help show this help message and exit
–num_processes NUM_PROCESSES, -np NUM_PROCESSES
The number of horovod child processes to be spawned.
Default is -1(equal to --gpus).
–gpus GPUS The number of GPUs to be used for the job.
–gpu_index GPU_INDEX [GPU_INDEX …]
The indices of the GPU’s to be used.
–use_amp Flag to enable Auto Mixed Precision.
–log_file LOG_FILE Path to the output log file.
-e EXPERIMENT_SPEC_FILE, --experiment_spec_file EXPERIMENT_SPEC_FILE
Path to spec file. Absolute path or relative to
working directory. If not specified, default spec from
spec_loader.py is used.
-r RESULTS_DIR, --results_dir RESULTS_DIR
Path to a folder where experiment outputs should be
written.
-n MODEL_NAME, --model_name MODEL_NAME
Name of the model file. If not given, then defaults to
model.hdf5.
-v, --verbose Set verbosity level for the logger.
-k KEY, --key KEY The key to load pretrained weights and save
intermediate snapshopts and final model.

tasks:
{calibration_tensorfile,dataset_convert,evaluate,export,inference,prune,train}

Please try again. Or login docker directly to check.
$ tao detectnet_v2 run /bin/bash
# detectnet_v2 train --help

root@f1419be94cb4:/workspace# detectnet_v2 train --help
Illegal instruction (core dumped)
root@f1419be94cb4:/workspace#

it show some wrong,how can i fix?

Seems that your cpu is a bit old. What is the cpu info ?
More, please search “Illegal instruction” in Tao forum. Previously some users get the same issues. Unfortunately TAO does not support it.

but it can work well in conversational AI
(launcher) xgy@xgy:~$ tao text_classification dataset_convert -h
2021-09-08 15:08:41,843 [INFO] root: Registry: [‘nvcr.io’]
2021-09-08 15:08:41,929 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/xgy/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
[NeMo W 2021-09-08 07:08:46 experimental:27] Module <class ‘nemo.collections.nlp.modules.common.megatron.megatron_bert.MegatronBertEncoder’> is experimental, not ready for production and is not fully supported. Use at your own risk.
INFO: Generating new fontManager, this may take some time…
usage: text_classification [-h] -r RESULTS_DIR [-k KEY] [-e EXPERIMENT_SPEC_FILE] [-g GPUS] [-m RESUME_MODEL_WEIGHTS] [-o OUTPUT_SPECS_DIR]
{dataset_convert,evaluate,export,finetune,infer,infer_onnx,train,download_specs}

Train Adapt Optimize Toolkit

positional arguments:
{dataset_convert,evaluate,export,finetune,infer,infer_onnx,train,download_specs}
Subtask for a given task/model.

optional arguments:
-h, --help show this help message and exit
-r RESULTS_DIR, --results_dir RESULTS_DIR
Path to a folder where the experiment outputs should be written. (DEFAULT: ./)
-k KEY, --key KEY User specific encoding key to save or load a .tlt model.
-e EXPERIMENT_SPEC_FILE, --experiment_spec_file EXPERIMENT_SPEC_FILE
Path to the experiment spec file.
-g GPUS, --gpus GPUS Number of GPUs to use. The default value is 1.
-m RESUME_MODEL_WEIGHTS, --resume_model_weights RESUME_MODEL_WEIGHTS
Path to a pre-trained model or model to continue training.
-o OUTPUT_SPECS_DIR, --output_specs_dir OUTPUT_SPECS_DIR
Path to a target folder where experiment spec files will be downloaded.
2021-09-08 15:08:47,234 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

What is the cpu info ?

My environment is virtual by proxmox,it can use? the cpu info
xgy@xgy:/data/xgy/worksapce/tao/cv_samples_v1.2.0$ cat /proc/cpuinfo | grep name | cut -f2 -d: | uniq -c
16 Common KVM processor

processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 6
model name : Common KVM processor
stepping : 1
microcode : 0x1
cpu MHz : 2800.000
cache size : 16384 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx lm constant_tsc nopl xtopology cpuid tsc_known_freq pni cx16 x2apic hypervisor lahf_lm cpuid_fault pti
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit
bogomips : 5600.00
clflush size : 64
cache_alignment : 128
address sizes : 40 bits physical, 48 bits virtual
power management:

Please try to find the info of real cpu instead of KVM.

OK,thank you

i fix the problem, the cpu have not avx instruction set,tensorflow compile in avx instruction set,KVM CPU add avx nstruction set,it can work well

1 Like