Jetson TX2 or TX1 clustering?


I was asked a little interesting question. :)
In general, performance of clustering several boards (eg, raspberry pi) and using tools such as OpenMP is better.
Is there a way to clustering multiple Jetson TX1s or TX2s to achieve the same effect as Nvidia’s SLI?

For example, suppose you have two TX1 boards. Can you make one GPU as if it were running?

In other words, I want to know if there is a way to group two TX1 (256 cuda core) into 512 cuda core and share memory.

No, you cannot make “one big GPU” in the way SLI works. SLI depends on the GPUs sharing the same host CPU RAM, as well as the special SLI link.
You can network multiple TX2 units and use the work distribution mechanism of your choice to run work. However, this is unlikely to actually give you any power or cost benefit over just running your work on “big” hardware. Nothing beats a Core i9 with a couple of GTX 1080 Ti cards in bang-for-the-buck (unless you need the highly specialized tensor processors in the high-end data center GPUs.)
Where TX2 fits, is where you need one of them for a particular, lightweight, embedded application.

Thanks snarky!, the problem clearly solved!!


You can certainly cluster them in the traditional way as snarky suggested and actually some apps do better on the Jetson relative to big iron. Not just in net FLOPs or OPs but in raw performance as well. This can be due to the replacement of (PCI) bus transfers with shared memory references. Much has been written on the Rodinia benchmark speedups for instance. Not sure how generalizable those solutions are but if your CUDA apps can use UMA, they should show some speedups.