Error when evaluate PointPillar network

_Supersu · June 1, 2023, 3:31am

Hi, I’m trying to evaulate my own model which is trained with TAO-Toolkit
However, numba error occurs…

I know that TAO Toolkit just uses docker environment, I didn’t touch other dependencies…

Anyone have same error like this?

Sysetm information
• Hardware: RTX3090
• Network Type: PointPillar
Here’s my results of tao info

Configuration of the TAO Toolkit Instance
dockers: [‘nvidia/tao/tao-toolkit’]
format_version: 2.0
toolkit_version: 4.0.1
published_date: 03/06/2023

When I evaluate my network by tao pointpillars evaluate -e $SPECS_DIR/pointpillars_cm.yaml -r $USER_EXPERIMENT_DIR -k $KEY, following error occurs

(launcher) ailab@3090-4:~/Project/04_HMG_AVC/TAO-PointPillars/pointpillars$ tao pointpillars evaluate -e $SPECS_DIR/pointpillars_cm.yaml -r $USER_EXPERIMENT_DIR -k $KEY
2023-05-31 13:24:01,936 [INFO] root: Registry: [‘nvcr.io’]
2023-05-31 13:24:01,982 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
2023-05-31 13:24:02,006 [WARNING] tlt.components.docker_handler.docker_handler:
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the “user”:“UID:GID” in the
DockerOptions portion of the “/home/ailab/.tao_mounts.json” file. You can obtain your
users UID and GID by using the “id -u” and “id -g” commands on the
terminal.
python /opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/evaluate.py --cfg_file /workspace/tao-experiments/specs/pointpillars_cm.yaml --output_dir /workspace/tao-experiments/pointpillars --key tlt_encode
2023-05-31 04:24:06,159 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: Start logging
2023-05-31 04:24:06,159 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: CUDA_VISIBLE_DEVICES=2
2023-05-31 04:24:08,532 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: Loading point cloud dataset
2023-05-31 04:24:08,576 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: Total samples for point cloud dataset: 2549
2023-05-31 04:24:08,576 [WARNING] root: ‘decrypt_stream’ is deprecated, to be removed in ‘0.7’. Please use ‘eff.codec.decrypt_stream()’ instead.
2023-05-31 04:24:10,273 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: *************** EPOCH 36 EVALUATION *****************
eval: 100%|████████████████████████| 2549/2549 [02:25<00:00, 17.56it/s, recall_0.3=(0, 28814) / 43040]
2023-05-31 04:26:35,438 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: *************** Performance of EPOCH 36 *****************
2023-05-31 04:26:35,438 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: Generate label finished(sec_per_example: 0.0569 second).
2023-05-31 04:26:35,439 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: recall_roi_0.3: 0.000000
2023-05-31 04:26:35,439 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: recall_rcnn_0.3: 0.669470
2023-05-31 04:26:35,439 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: recall_roi_0.5: 0.000000
2023-05-31 04:26:35,440 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: recall_rcnn_0.5: 0.661640
2023-05-31 04:26:35,440 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: recall_roi_0.7: 0.000000
2023-05-31 04:26:35,440 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: recall_rcnn_0.7: 0.614986
2023-05-31 04:26:35,449 [INFO] nvidia_tao_pytorch.pointcloud.pointpillars.pcdet.utils.common_utils: Average predicted number of objects(2549 samples): 13.352
2023-05-31 04:26:37,331 [INFO] numba.cuda.cudadrv.driver: init
2023-05-31 04:26:37,618 [ERROR] numba.cuda.cudadrv.driver: Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
Traceback (most recent call last):
File “/opt/conda/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 2160, in add_ptx
driver.cuLinkAddData(self.handle, enums.CU_JIT_INPUT_PTX,
File “/opt/conda/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 300, in safe_cuda_api_call
self._check_error(fname, retcode)
File “/opt/conda/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 335, in _check_error
raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/evaluate.py>”, line 3, in
File “”, line 231, in
File “”, line 223, in main
File “”, line 71, in eval_single_ckpt
File “”, line 114, in eval_one_epoch
File “”, line 281, in evaluation
File “</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/pcdet/datasets/kitti/kitti_object_eval_python/eval.py>”, line 1, in
File “”, line 7, in
File “</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/pcdet/datasets/kitti/kitti_object_eval_python/rotate_iou.py>”, line 1, in
File “”, line 304, in
File “/opt/conda/lib/python3.8/site-packages/numba/cuda/decorators.py”, line 95, in kernel_jit
return Dispatcher(func, [func_or_sig], targetoptions=targetoptions)
File “/opt/conda/lib/python3.8/site-packages/numba/cuda/compiler.py”, line 899, in init
self.compile(sigs[0])
File “/opt/conda/lib/python3.8/site-packages/numba/cuda/compiler.py”, line 1102, in compile
kernel.bind()
File “/opt/conda/lib/python3.8/site-packages/numba/cuda/compiler.py”, line 590, in bind
self._func.get()
File “/opt/conda/lib/python3.8/site-packages/numba/cuda/compiler.py”, line 441, in get
linker.add_ptx(ptx)
File “/opt/conda/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py”, line 2163, in add_ptx
raise LinkerError(“%s\n%s” % (e, self.error_log))
numba.cuda.cudadrv.driver.LinkerError: [222] Call to cuLinkAddData results in UNKNOWN_CUDA_ERROR
ptxas application ptx input, line 9; fatal : Unsupported .version 7.8; current version is ‘7.4’

PyCUDA ERROR: The context stack was not empty upon module cleanup.

A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.

Morganh · June 1, 2023, 10:01am

Can you share the result of $nvidia-smi ?

_Supersu · June 1, 2023, 10:18am

Here’s result of nvidia-smi in host machine

Thu Jun  1 19:13:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   29C    P8    24W / 350W |      6MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:2D:00.0 Off |                  N/A |
| 30%   38C    P8    26W / 350W |      6MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:41:00.0  On |                  N/A |
| 30%   40C    P8    34W / 350W |    241MiB / 24265MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:61:00.0 Off |                  N/A |
| 30%   33C    P8    30W / 350W |      6MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1908      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1908      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      1908      G   /usr/lib/xorg/Xorg                131MiB |
|    2   N/A  N/A      2186      G   /usr/bin/gnome-shell               45MiB |
|    2   N/A  N/A     11172      G   ...RendererForSitePerProcess       19MiB |
|    2   N/A  N/A     11583      G   ...RendererForSitePerProcess       40MiB |
|    3   N/A  N/A      1908      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

And here’s result of nvidia-smi in tao pointpillar container

root@e2db6b22efae:/opt/nvidia/tools# nvidia-smi
Thu Jun  1 10:17:47 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.182.03   Driver Version: 470.182.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   29C    P8    23W / 350W |      6MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:2D:00.0 Off |                  N/A |
| 30%   39C    P8    27W / 350W |      6MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce ...  Off  | 00000000:41:00.0  On |                  N/A |
| 30%   40C    P8    35W / 350W |    239MiB / 24265MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce ...  Off  | 00000000:61:00.0 Off |                  N/A |
| 30%   33C    P8    30W / 350W |      6MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Morganh · June 1, 2023, 1:56pm

Please update nvidia-driver to 525.

Uninstall:  
               sudo apt purge nvidia-driver-470
               sudo apt autoremove
               sudo apt autoclean

Install:   
              sudo apt install nvidia-driver-525

_Supersu · June 2, 2023, 5:41am

Thanks.

It works!

Also, There is another issue when trying to training network via multi-gpu.

Here’s my output of tao pointpillars train -e $SPECS_DIR/pointpillars_cm.yaml -r $USER_EXPERIMENT_DIR -k $KEY --gpus 4

2023-06-02 11:52:37,440 [INFO] root: Registry: ['nvcr.io']
2023-06-02 11:52:37,481 [INFO] tlt.components.instance_handler.local_instance: Running command in container: nvcr.io/nvidia/tao/tao-toolkit:4.0.0-pyt
2023-06-02 11:52:37,503 [WARNING] tlt.components.docker_handler.docker_handler: 
Docker will run the commands as root. If you would like to retain your
local host permissions, please add the "user":"UID:GID" in the
DockerOptions portion of the "/home/ailab/.tao_mounts.json" file. You can obtain your
users UID and GID by using the "id -u" and "id -g" commands on the
terminal.
python -m torch.distributed.launch --nproc_per_node=4 /opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py --cfg_file /workspace/tao-experiments/specs/pointpillars_cm.yaml --output_dir /workspace/tao-experiments/pointpillars --key tlt_encode --gpus 4 
/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py:180: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use_env is set by default in torchrun.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
  File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
  File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
  File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
  File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
  File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
        store, rank, world_size = next(rendezvous_iterator)store, rank, world_size = next(rendezvous_iterator)

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
        store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)

  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
    tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
        tcp_store = TCPStore(hostname, port, world_size, False, timeout)tcp_store = TCPStore(hostname, port, world_size, False, timeout)

TimeoutErrorTimeoutError: : The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).

[E socket.cpp:860] [c10d] The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
Traceback (most recent call last):
  File "</opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py>", line 3, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 152, in <module>
  File "<frozen pointcloud.pointpillars.scripts.train>", line 58, in main
  File "<frozen pointcloud.pointpillars.pcdet.utils.common_utils>", line 156, in init_dist_pytorch
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 627, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 173, in _create_c10d_store
    tcp_store = TCPStore(hostname, port, world_size, False, timeout)
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 18888).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 484) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/opt/conda/lib/python3.8/site-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-06-02_03:22:42
  host      : 996ce2b3a184
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 485)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-06-02_03:22:42
  host      : 996ce2b3a184
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 486)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2023-06-02_03:22:42
  host      : 996ce2b3a184
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 487)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-06-02_03:22:42
  host      : 996ce2b3a184
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 484)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
2023-06-02 12:22:43,135 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

And here’s my pointpillars_cm.yaml

CLASS_NAMES: ['Car']
DATA_CONFIG: 
    DATASET: 'GeneralPCDataset'
    DATA_PATH: '/workspace/tao-experiments/data'
    DATA_SPLIT: {
        'train': train,
        'test': val
    }
    INFO_PATH: {
        'train': [infos_train.pkl],
        'test': [infos_val.pkl],
    }
    BALANCED_RESAMPLING: False
    POINT_FEATURE_ENCODING: {
        encoding_type: absolute_coordinates_encoding,
        used_feature_list: ['x', 'y', 'z', 'intensity'],
        src_feature_list: ['x', 'y', 'z', 'intensity'],
    }
    POINT_CLOUD_RANGE: [-69.12, -39.68, -3, 69.12, 39.68, 1]
    DATA_AUGMENTOR:
        DISABLE_AUG_LIST: ['placeholder']
        AUG_CONFIG_LIST:
            - NAME: gt_sampling
              DB_INFO_PATH:
                  - dbinfos_train.pkl
              PREPARE: {
                 filter_by_min_points: ['Car:5'] #, 'Pedestrian:5', 'Cyclist:5'],
              }
              SAMPLE_GROUPS: ['Car:15'] #,'Pedestrian:15', 'Cyclist:15']
              NUM_POINT_FEATURES: 4
              DATABASE_WITH_FAKELIDAR: False
              REMOVE_EXTRA_WIDTH: [0.0, 0.0, 0.0]
              LIMIT_WHOLE_SCENE: False
            - NAME: random_world_flip
              ALONG_AXIS_LIST: ['x']
            - NAME: random_world_rotation
              WORLD_ROT_ANGLE: [-0.78539816, 0.78539816]
            - NAME: random_world_scaling
              WORLD_SCALE_RANGE: [0.95, 1.05]
    DATA_PROCESSOR:
        - NAME: mask_points_and_boxes_outside_range
          REMOVE_OUTSIDE_BOXES: True
        - NAME: shuffle_points
          SHUFFLE_ENABLED: {
              'train': True,
              'test': False
          }
        - NAME: transform_points_to_voxels
          VOXEL_SIZE: [0.16, 0.16, 4]
          MAX_POINTS_PER_VOXEL: 32
          MAX_NUMBER_OF_VOXELS: {
              'train': 16000,
              'test': 10000
          }
    NUM_WORKERS: 4

MODEL:
    NAME: PointPillar
    VFE:
        NAME: PillarVFE
        WITH_DISTANCE: False
        USE_ABSLOTE_XYZ: True
        USE_NORM: True
        NUM_FILTERS: [64]
    MAP_TO_BEV:
        NAME: PointPillarScatter
        NUM_BEV_FEATURES: 64
    BACKBONE_2D:
        NAME: BaseBEVBackbone
        LAYER_NUMS: [3, 5, 5]
        LAYER_STRIDES: [2, 2, 2]
        NUM_FILTERS: [64, 128, 256]
        UPSAMPLE_STRIDES: [1, 2, 4]
        NUM_UPSAMPLE_FILTERS: [128, 128, 128]
    DENSE_HEAD:
        NAME: AnchorHeadSingle
        CLASS_AGNOSTIC: False
        USE_DIRECTION_CLASSIFIER: True
        DIR_OFFSET: 0.78539
        DIR_LIMIT_OFFSET: 0.0
        NUM_DIR_BINS: 2
        ANCHOR_GENERATOR_CONFIG: [
            {
                'class_name': 'Car',
                'anchor_sizes': [[4.64, 1.90, 1.38]],
                'anchor_rotations': [-1.57, 1.57],
                'anchor_bottom_heights': [-1.78],
                'align_center': False,
                'feature_map_stride': 2,
                'matched_threshold': 0.6,
                'unmatched_threshold': 0.45
            }
            # {
            #     'class_name': 'Pedestrian',
            #     'anchor_sizes': [[0.8, 0.6, 1.73]],
            #     'anchor_rotations': [0, 1.57],
            #     'anchor_bottom_heights': [-0.6],
            #     'align_center': False,
            #     'feature_map_stride': 2,
            #     'matched_threshold': 0.5,
            #     'unmatched_threshold': 0.35
            # },
            # {
            #     'class_name': 'Cyclist',
            #     'anchor_sizes': [[1.76, 0.6, 1.73]],
            #     'anchor_rotations': [0, 1.57],
            #     'anchor_bottom_heights': [-0.6],
            #     'align_center': False,
            #     'feature_map_stride': 2,
            #     'matched_threshold': 0.5,
            #     'unmatched_threshold': 0.35
            # }
        ]
        TARGET_ASSIGNER_CONFIG:
            NAME: AxisAlignedTargetAssigner
            POS_FRACTION: -1.0
            SAMPLE_SIZE: 512
            NORM_BY_NUM_EXAMPLES: False
            MATCH_HEIGHT: False
            BOX_CODER: ResidualCoder
        LOSS_CONFIG:
            LOSS_WEIGHTS: {
                'cls_weight': 1.0,
                'loc_weight': 2.0,
                'dir_weight': 0.2,
                'code_weights': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
            }
    POST_PROCESSING:
        RECALL_THRESH_LIST: [0.3, 0.5, 0.7]
        SCORE_THRESH: 0.1
        OUTPUT_RAW_SCORE: False
        EVAL_METRIC: kitti
        NMS_CONFIG:
            MULTI_CLASSES_NMS: False
            NMS_TYPE: nms_gpu
            NMS_THRESH: 0.01
            NMS_PRE_MAXSIZE: 4096
            NMS_POST_MAXSIZE: 500
    SYNC_BN: False

OPTIMIZATION:
    BATCH_SIZE_PER_GPU: 8
    NUM_EPOCHS: 40
    OPTIMIZER: adam_onecycle
    LR: 0.003
    WEIGHT_DECAY: 0.01
    MOMENTUM: 0.9
    MOMS: [0.95, 0.85]
    PCT_START: 0.4
    DIV_FACTOR: 10
    DECAY_STEP_LIST: [35, 45]
    LR_DECAY: 0.1
    LR_CLIP: 0.0000001
    LR_WARMUP: False
    WARMUP_EPOCH: 1
    GRAD_NORM_CLIP: 10
    RESUME_MODEL_PATH: null 
    PRETRAINED_MODEL_PATH: null
    PRUNED_MODEL_PATH: null
    TCP_PORT: 18888
    RANDOM_SEED: null
    CKPT_INTERVAL: 1
    MAX_CKPT_SAVE_NUM: 30
    MERGE_ALL_ITERS_TO_ONE_EPOCH: False

EVALUATION:
    BATCH_SIZE: 1
    CKPT: "/workspace/tao-experiments/pointpillars/ckpt/checkpoint_epoch_40.tlt"

INFERENCE:
    MAX_POINTS_NUM: 25000
    BATCH_SIZE: 1
    CKPT: "/workspace/tao-experiments/pointpillars/ckpt/checkpoint_epoch_40.tlt"
    VIS_CONF_THRESH: 0.1

Morganh · June 4, 2023, 3:55pm

For the new error, please create a new forum topic. Thanks.

system · June 18, 2023, 3:55pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Train Pointpillar with Multi-GPU TAO Toolkit tao	11	2434	August 29, 2023
Classification_pyt error TAO Toolkit jetson	16	72	September 18, 2024
Probelm as running visual_changenet_classification on TAO launcher TAO Toolkit	41	1019	November 21, 2023
Tao Text Classification Evaluate failing TAO Toolkit	18	1357	October 12, 2021
Error in TAO-Toolkit while training TAO Toolkit	15	1492	July 6, 2022
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1692	August 22, 2023
Tao Text Classification Evaluate failing TAO Toolkit tao	5	1351	October 12, 2021
More than 1 GPU not working using Tao Train TAO Toolkit	47	4371	April 9, 2023
Problem with tlt file mounting TAO Toolkit	29	2332	January 6, 2022
OCDNet Tao Model Zoo TAO Toolkit jetson	7	36	October 22, 2024

Error when evaluate PointPillar network

PyCUDA ERROR: The context stack was not empty upon module cleanup.

Related topics