CUDA out of memory. Tried to allocate 314.00 MiB. GPU Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace

bozdagosman89 · January 8, 2025, 9:41am

Please provide the following information when requesting support.

• Hardware (GTX1650)
• Network Type pointpillars
• TLT Version
Configuration of the TAO Toolkit Instance

task_group:
model:
dockers:
nvidia/tao/tao-toolkit:
5.5.0-pyt:
docker_registry: nvcr.io
tasks:
1. action_recognition
2. centerpose
3. visual_changenet
4. deformable_detr
5. dino
6. grounding_dino
7. mask_grounding_dino
8. mask2former
9. mal
10. ml_recog
11. ocdnet
12. ocrnet
13. optical_inspection
14. pointpillars
15. pose_classification
16. re_identification
17. classification_pyt
18. segformer
19. bevfusion
5.0.0-tf1.15.5:
docker_registry: nvcr.io
tasks:
1. bpnet
2. classification_tf1
3. converter
4. detectnet_v2
5. dssd
6. efficientdet_tf1
7. faster_rcnn
8. fpenet
9. lprnet
10. mask_rcnn
11. multitask_classification
12. retinanet
13. ssd
14. unet
15. yolo_v3
16. yolo_v4
17. yolo_v4_tiny
5.5.0-tf2:
docker_registry: nvcr.io
tasks:
1. classification_tf2
2. efficientdet_tf2
dataset:
dockers:
nvidia/tao/tao-toolkit:
5.5.0-data-services:
docker_registry: nvcr.io
tasks:
1. augmentation
2. auto_label
3. annotations
4. analytics
deploy:
dockers:
nvidia/tao/tao-toolkit:
5.5.0-deploy:
docker_registry: nvcr.io
tasks:
1. visual_changenet
2. centerpose
3. classification_pyt
4. classification_tf1
5. classification_tf2
6. deformable_detr
7. detectnet_v2
8. dino
9. dssd
10. efficientdet_tf1
11. efficientdet_tf2
12. faster_rcnn
13. grounding_dino
14. mask_grounding_dino
15. mask2former
16. lprnet
17. mask_rcnn
18. ml_recog
19. multitask_classification
20. ocdnet
21. ocrnet
22. optical_inspection
23. retinanet
24. segformer
25. ssd
26. trtexec
27. unet
28. yolo_v3
29. yolo_v4
30. yolo_v4_tiny
format_version: 3.0
toolkit_version: 5.5.0
published_date: 08/26/2024

I just run the tutorial for pointpillar and i get this issue how can i fix it

pointpillars.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/hydra/hydra_runner.py:107: UserWarning:
‘pointpillars.yaml’ is validated against ConfigStore schema with the same name.
This behavior is deprecated in Hydra 1.1 and will be removed in Hydra 1.2.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/automatic_schema_matching for migration instructions.
_run_hydra(
/usr/local/lib/python3.10/dist-packages/hydra/_internal/hydra.py:119: UserWarning: Future Hydra versions will no longer change working directory at job runtime by default.
See https://hydra.cc/docs/next/upgrades/1.1_to_1.2/changes_to_job_working_dir/ for more information.
ret = run_job(
/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/core/loggers/api_logging.py:236: UserWarning: Log file already exists at /workspace/tao-experiments/pointpillars/status.json
rank_zero_warn(
Start logging
CUDA_VISIBLE_DEVICES=0
Database filter by min points Car: 14357 => 13461
Database filter by min points Pedestrian: 2207 => 2161
Database filter by min points Cyclist: 734 => 700
Loading point cloud dataset
Total samples for point cloud dataset: 3712
/usr/local/lib/python3.10/dist-packages/torch/functional.py:512: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/native/TensorShape.cpp:3553.)
Start trainingr-defined]

epochs: 0%| | 0/80 [00:00<?, ?it/s]
Traceback (most recent call last):: [‘results_dir=/workspace/tao-experiments/pointpillars’, ‘dataset.data_info_path=/workspace/tao-experiments/pointpillars/data_info’, ‘key=tlt_encode’]
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/pointcloud/pointpillars/scripts/train.py”, line 135, in main
train_model(
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/pointcloud/pointpillars/tools/train_utils/train_utils.py”, line 126, in train_model
accumulated_iter = train_one_epoch(
File “/usr/local/lib/python3.10/dist-packages/nvidia_tao_pytorch/pointcloud/pointpillars/tools/train_utils/train_utils.py”, line 62, in train_one_epoch
loss.backward()
File “/usr/local/lib/python3.10/dist-packages/torch/_tensor.py”, line 525, in backward
torch.autograd.backward(
File “/usr/local/lib/python3.10/dist-packages/torch/autograd/init.py”, line 267, in backward
_engine_run_backward(
File “/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py”, line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 314.00 MiB. GPU

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

                                          [2025-01-08 09:01:24,507 - TAO Toolkit - root - INFO] Sending telemetry data.

[2025-01-08 09:01:24,507 - TAO Toolkit - root - INFO] ================> Start Reporting Telemetry <================
[2025-01-08 09:01:24,507 - TAO Toolkit - root - INFO] Sending {‘version’: ‘5.5.0’, ‘action’: ‘train’, ‘network’: ‘pointpillars’, ‘gpu’: [‘NVIDIA-GeForce-GTX-1650’], ‘success’: True, ‘time_lapsed’: 11} to https://api.tao.ngc.nvidia.com.
[2025-01-08 09:01:25,855 - TAO Toolkit - root - INFO] Telemetry sent successfully.
[2025-01-08 09:01:25,855 - TAO Toolkit - root - INFO] ================> End Reporting Telemetry <================
[2025-01-08 09:01:25,855 - TAO Toolkit - root - INFO] Execution status: PASS
2025-01-08 12:01:26,630 [TAO Toolkit] [INFO] nvidia_tao_cli.components.docker_handler.docker_handler 363: Stopping container.

Morganh · January 9, 2025, 2:48am

It is due to out of GPU memory. Please change GTX1650 to others. Suggest to use 12G memory or above.

bozdagosman89 · January 9, 2025, 6:29am

Yes when i decrease the batch size to 2 it worked thanks

system · January 23, 2025, 6:29am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm) TAO Toolkit	4	21	January 31, 2025
Grounding dino : out of memory TAO Toolkit	6	20	January 22, 2025
OCDNet Tao Model Zoo TAO Toolkit jetson	7	37	October 22, 2024
Error in TAO-Toolkit while training TAO Toolkit	15	1502	July 6, 2022
Error when evaluate PointPillar network TAO Toolkit	6	733	June 4, 2023
Tao toolkit version5 is getting error when comes to training part TAO Toolkit	45	1701	August 22, 2023
Classification_pyt error TAO Toolkit jetson	16	79	September 18, 2024
Train Pointpillar with Multi-GPU TAO Toolkit tao	11	2456	August 29, 2023
TAO not running when using multiple GPUs TAO Toolkit	12	34	August 17, 2024
Optical Inspection Deploy TAO Toolkit	7	658	September 22, 2023

CUDA out of memory. Tried to allocate 314.00 MiB. GPU Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace

Related topics