Distributing machine learning model on Jetson Tx2 AGX Xavier

m.shakeel · October 19, 2018, 10:33am

I am trying to distribute deep learning models on two Xavier Jetson Tx2 devices and two other GPUs (GeForce1070, Tesla2075) for experimental purpose.

My experience of distributing Tensorflow-gpu model among these four heterogeneous GPU devices connected in a LAN configuration is neither easy, nor a success so far, possibly due to configurations and communication issues.

Following web tutorials related to Distributed Tensorflow-gpu, for asynchronous gradient descent, simple MNIST model halts as soon as 3rd worker tries to synchronize its gradients with chief worker over gRPC.

I am looking for either a more suitable communication library to tie up GPUs, OR, shifting to PyTorch or other framework at all.

So, my question is:
Does Xavier AGX support any communication mechanism for such heterogeneous GPUs setup, other than Tensorflow GRPC? (e.g. NCCL, openMPI, Gloo, …)

OR
If anybody have experience in such heterogeneous distribution then please suggest you have to make this distribution running looking at following experimental configurations?

Configurations

Python: 3.5
OS: Ubuntu 16.04
Cuda: 10.0 (Geforce/Tx2), 7.5 (Tesla)
Tensorflow-gpu: 1.9 (Geforce/Tesla), 1.10 (Xavier)

AastaLLL · October 23, 2018, 2:40am

Hi,

This issue can better be solved on the Xavier platform.

Xavier has NVLINK to connect other iGPU(ex. Jetson) platform.
You can also link some dGPU (ex. Tesla card) with NVSWITCH.

Check this introduction for more information:
https://www.nvidia.com/en-us/data-center/nvlink/

Thanks.

m.shakeel · October 24, 2018, 8:32am

Thank you, but For my use case, I intend to distribute the GPU nodes across LAN/WAN.
Therefore, NVLINK does not seem to be my option.
In fact I am looking for kinda extended version of this cluster [[url]http://selkie-macalester.org/csinparallel/modules/RosieCluster/build/html/[/url]]. (this guy has done splendid job but I have some queries to implement and take it further)

AastaLLL · October 29, 2018, 6:21am

Hi,

Maybe this also can help:
[url]https://github.com/uber/horovod[/url]

Thanks.

Topic		Replies	Views
What is the best way to connect multi Jetson Xavier NX boards and communicate between them? Jetson Xavier NX	2	806	October 18, 2021
Can I use Jetson AGX Xavier for traning purposes? Jetson AGX Xavier	4	3193	October 18, 2021
Can we chain multiple Xavier Jetson's together either in EP mode or via NTB? Jetson AGX Xavier	4	2213	October 18, 2021
Can we connect multiple Jetson AGX Xavier with NVLink Jetson AGX Xavier	0	606	December 4, 2019
multithread for CPU/GPU parrallel Jetson AGX Xavier	3	565	October 18, 2021
Model Training on NX Jetson Xavier NX ai-training	6	2036	October 18, 2021
Linking multiple Xavier cards together for more performance Jetson AGX Xavier	3	1398	October 18, 2021
What can high speed data communication Tech between Xaviers? Jetson AGX Xavier	3	667	October 18, 2021
run multiple models at one time on xavier. Jetson AGX Xavier	3	2008	October 18, 2021
Newbie Here, Question about linking together AGX Xaviers for increased capability Jetson AGX Xavier	3	596	October 18, 2021

Distributing machine learning model on Jetson Tx2 AGX Xavier

Configurations

Related topics