Stack Error on TAO Toolkit

alper.ekmekci · April 25, 2022, 2:56pm

• WSL 2 Ubuntu (20.04)
• CUDA 11.6
• Hardware (Asus TUF Dash 15/Geforce RTX 3060)
• Network Type (Yolo_v3 from cv_samples)
• TAO Version (3.22.02)
• NCCL Version (2.12.10)

I’m following “yolo_v3” notebook of CV samples from NVIDIA TAO tutorials page. At the end of the first epoch,

self._traceback = tf_stack.extract_stack()

error apears.
tao_stack_log.txt (62.9 KB)

On the log section, TAO toolkit, CUDA and NCCL version are not be detected properly. My setup consists CUDA 11.6, NCCL version 2.12 and TAO toolkit 3.22.02 but toolkit throws a log likeNCCL version 2.9.9+cuda11.3 and (nvcr . io / nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3) . Even though I update my NCCL toolkit manually, still does not see the correct version.

Besides that there is always two root errors which is Unknown: ncclCommInitRank failed: unhandled system error. I’ve uploaded the log file for understanding the problem well. Any help would be perfect for me. Thanks for your advices.

Best
Alper

Morganh · April 25, 2022, 3:05pm

Refer to WSL2 & TAO issues - #16 by Morganh

alper.ekmekci · April 25, 2022, 4:22pm

I upgraded version of NCCL inside of the docker and It seems correct when I control the version inside of the related docker. However when I run yolo_v3.ipynb inside VSCode (WSL2) same error occurs.

UPDATE : executing train command inside of the yolo_v3 docker works. Is there a certain solution to solve this issue ?

Morganh · April 27, 2022, 6:36am

Yes, next TAO release will implement new NCCL.

system · May 11, 2022, 6:36am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.