Is there an R35 NGC container available with torch distributed?

Do any of the R35 Pytorch images have torch distributed enabled? I am currently using the R34 image that Dusty NV pointed someone to earlier but I would like to be on R35.

Thanks,

bb

Hi @brbl, I believe that nvcr.io/nvidia/l4t-pytorch:r35.1.0-pth1.11-py3 should have distributed enabled. Further containers wouldn’t have it enabled.

Thanks, and right as usual! I thought I checked that version and didn’t see it, but I rechecked, and you were right.

You have been super helpful, so no need to reply, but if you had time, I am curious (and it is curiosity and not a criticism) why distributed isn’t a standard part of the package?

Thanks again,

bb

Those previous PyTorch wheels that has distributed enabled were personally built by me, however the newer official wheels are built by a team at NVIDIA, and they have formal testing/QA release procedures of which MPI isn’t included (and understandably so for an embedded platform, it isn’t a very common use-case). Overall it’s a net gain improving the quality of the builds, but sorry for the inconvenience. You can always rebuilt the PyTorch wheel(s) with distributed enabled and install them in container (or rebuild the container) if you really need it.

I hate to be such a needy guy, but the “CUDA error: no kernel image is available for execution on the device” has cropped up again with the latest containers when using torchvision.ops.nms(). Is it possible that your patch didn’t get into the new official wheels?

Sorry.

Ah right, that’s still the same container with that issue - sorry about that (it will be fixed in the JetPack 5.1 version of l4t-pytorch). Try using this one instead: dustynv/l4t-pytorch:r35.1.0-pth1.11-py3 (that one has the torchvision bug patched)