Training of Bevfusion model failed on nuscenes dataset

Hello, I am trying to use the jupyter notebook that is associated with the Bevfusion model with the nuscenes dataset. I have changed the training spec file, which I have attached. But I get the following error when I try to run it.

Error executing job with overrides: [‘train.pretrained_checkpoint=/workspace/tao-experiments/bevfusion/bevfusion_v1.0/tao3d_bevfusion_epoch4.pth’, ‘results_dir=/results’]Traceback (most recent call last):

/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/bevfusion/scripts/train.py FAILED

bev_nu.txt (1.6 KB)

Any tips on what the issue could be?

Please share the full log. Thanks.

Here is the log
Error executing job with overrides: [‘train.pretrained_checkpoint=/workspace/tao-experiments/bevfusion/bevfusion_v1.0/tao3d_bevfusion_epoch4.pth’, ‘results_dir=/results’]Traceback (most recent call last):
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 69, in _func
raise e
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/decorators/workflow.py”, line 48, in _func
runner(cfg, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/bevfusion/scripts/train.py”, line 62, in main
run_experiment(experiment_config=cfg)
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/bevfusion/scripts/train.py”, line 44, in run_experiment
runner.train()
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1728, in train
self._train_loop = self.build_train_loop(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1527, in build_train_loop
loop = EpochBasedTrainLoop(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/loops.py”, line 44, in init
super().init(runner, dataloader)
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/base_loop.py”, line 26, in init
self.dataloader = runner.build_dataloader(
File “/usr/local/lib/python3.10/dist-packages/mmengine/runner/runner.py”, line 1370, in build_dataloader
dataset = DATASETS.build(dataset_cfg)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmengine/dataset/dataset_wrapper.py”, line 223, in init
self.dataset = DATASETS.build(dataset)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/registry.py”, line 570, in build
return self.build_func(cfg, *args, **kwargs, registry=self)
File “/usr/local/lib/python3.10/dist-packages/mmengine/registry/build_functions.py”, line 121, in build_from_cfg
obj = obj_cls(**args) # type: ignore
File “/usr/local/lib/python3.10/dist-packages/mmdet3d/datasets/nuscenes_dataset.py”, line 102, in init
super().init(
File “/usr/local/lib/python3.10/dist-packages/mmdet3d/datasets/det3d_dataset.py”, line 126, in init
super().init(
TypeError: BaseDataset.init() got an unexpected keyword argument ‘origin’

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
E1009 14:47:16.990000 129253466547328 torch/distributed/elastic/multiprocessing/api.py:881] failed (exitcode: 1) local_rank: 0 (pid: 420) of binary: /usr/bin/python
Traceback (most recent call last):
File “/usr/local/bin/torchrun”, line 8, in
sys.exit(main())
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 347, in wrapper
return f(*args, **kwargs)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 879, in main
run(args)
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py”, line 870, in run
elastic_launch(
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py”, line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/bevfusion/scripts/train.py FAILED

The error is as above. You can add debug code to check further.

You can open a terminal and run below.
$docker run --runtime=nvidia -it --rm nvcr.io/nvidia/tao/tao-toolkit:5.5.0-pyt /bin/bash

Then inside the docker, you can run training.
#bevfusion train xxx

All the codes can be found under
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/cv/
/usr/local/lib/python3.10/dist-packages/mmdet3d/

More, please run with default notebook firstly to get familiar with the steps.
Then you can leverage the default kitti dataset training and check if there is gap in format/spec/etc between Kitti and Nuscenes.