Stack Error on TAO Toolkit

• WSL 2 Ubuntu (20.04)
• CUDA 11.6
• Hardware (Asus TUF Dash 15/Geforce RTX 3060)
• Network Type (Yolo_v3 from cv_samples)
• TAO Version (3.22.02)
• NCCL Version (2.12.10)

I’m following “yolo_v3” notebook of CV samples from NVIDIA TAO tutorials page. At the end of the first epoch,

self._traceback = tf_stack.extract_stack()

error apears.
tao_stack_log.txt (62.9 KB)

On the log section, TAO toolkit, CUDA and NCCL version are not be detected properly. My setup consists CUDA 11.6, NCCL version 2.12 and TAO toolkit 3.22.02 but toolkit throws a log likeNCCL version 2.9.9+cuda11.3 and (nvcr . io / nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3) . Even though I update my NCCL toolkit manually, still does not see the correct version.

Besides that there is always two root errors which is Unknown: ncclCommInitRank failed: unhandled system error. I’ve uploaded the log file for understanding the problem well. Any help would be perfect for me. Thanks for your advices.

Best
Alper

Refer to WSL2 & TAO issues - #16 by Morganh

I upgraded version of NCCL inside of the docker and It seems correct when I control the version inside of the related docker. However when I run yolo_v3.ipynb inside VSCode (WSL2) same error occurs.

UPDATE : executing train command inside of the yolo_v3 docker works. Is there a certain solution to solve this issue ?

Yes, next TAO release will implement new NCCL.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.