Jetson TK1 MPI Cluster

The original thread is locked so I am opening a new one. ETA https://devtalk.nvidia.com/default/topic/761234/

Does anyone have any success stories using Jetson TK1 with John the Ripper specifically? I’d like to start building a small cluster of these development boards. I am also interested in MPI-based clusters that are relatively inexpensive but can tackle “real” workloads. I have built model clusters (i.e. like 66-ARM-10 nodes running MPI.)

That model is great for labwork simulating traffic loads for a network-based event or simulating network attacks, DNS server attacks, UDP flooding, heavy loads, this sort of thing. Very anemic computing and impractical other than a teaching model.

Now that we have boards with GPU on them that can be used in the way we used to use standalone array processors in the old days (yes I am older) I was thinking it is possibly finally affordable to build a “Real” supercomputer on a budget with the ability to keep adding nodes until it was a respectable tool.

I did a bit of searching around and there isn’t a lot of material about CUDA and JtR, whether or not OpenCL ever became a reality, etc. There aren’t a lot of useful practical guides on putting something like this together as well which is what I thought I would try to achieve once i documented my own process of bootstrapping all this.

I was just going to ask Pyrex who posted the original thread in a PM but it seems that prompted “Invalid Access” errors when I made the several attempts. So Pyrex if you are still active, please send me a message!

I also have a Jetson TX1 (newly acquired) I hope to use for some video work and control i/o (personal lab use) I hope to use the TX1 for projects a Cubie board or Raspberry Pi are much too anemic for.

Hi 1ST-Terminus, you could try using the CUDA jumbo branch of JtR: [url]https://github.com/magnumripper/JohnTheRipper/tree/CUDA[/url]

There aren’t plans to support OpenCL on Jetson. As per the previous topic, MPI should work normally.

Hi, Thank You. I will use that when I actually start building. I had to reload my image on the TK1 and am rebuilding all the environment. [url]The file does not exist That factory image provides

R21 (release), REVISION: 5.0, GCID: 7273100, BOARD: ardbeg, EABI: hard, DATE:

Wed Jun 8 04:19:09 UTC 2016

I am installing cuda-repo-ubuntu1404_6.5-14_armhf.deb for that version and will try to start installing all the packages I need to build the code.

Yes, Thanks! I cloned it tonight but can’t work on it yet I have to get some new barrel connectors and solder up some wiring harness for (4) TK1s.

I also ordered a 20A switching supply. I bought a couple 12V-5V DC converters so I can power a couple USB hubs and a GIG-E ethernet switch that I am combining into one complete unit with the TK1x4 boards.

I am building a permanent cluster, my ambitions are just to have an appliance to find weak passwords.

I figure 1300 GLOP should get decent results.

I understand full load is really close to 5A per GPU during max. utilization so hopefully my new power supply will handle them. I work from home so I am not interested in using a lot of power if possible.

A continuous load of 20A DC is probably going to be about 2-4A AC which isn’t bad if it takes much less time to do the work.

I am already using MPI not OpenMPI (I use MPICH-2 now) on 66-ARM CPU cluster which is fun but not practical (was only intended to be a working model.)

Using the (4) TK1 units and OpenMPI will be a more modern approach with real capabilities.

I will most likely continue to add nodes but we’ll see how these first four nodes work.

Thanks again.

I compiled the code, everything went well I intend to test some time this week with 4 TK1s when I have some free time.

ubuntu@tegra-ubuntu:~/JohnTheRipper-CUDA/src$ …/run/john --list=cuda-devices
CUDA runtime 6.5, driver 6.5 - 1 CUDA device found:

CUDA Device #0
Name: GK20A
Type: integrated
Compute capability: 3.2 (sm_32)
Number of stream processors: 192 (1 x 192)
Clock rate: 852 Mhz
Memory clock rate (peak) 924 Mhz
Memory bus width 64 bits
Peak memory bandwidth: 14 GB/s
Total global memory: 1.0 GB
Total shared memory per block: 48.0 KB
Total constant memory: 64.0 KB
L2 cache size 128.1 KB
Kernel execution timeout: No
Concurrent copy and execution: One direction
Concurrent kernels support: Yes
Warp size: 32
Max. GPRs/thread block 32768
Max. threads per block 1024
Max. resident threads per MP 2048
PCI device topology: 00:00.0

ubuntu@tegra-ubuntu:~/JohnTheRipper-CUDA/src$ …/run/john --list=formats --format=cuda
md5crypt-cuda, sha256crypt-cuda, sha512crypt-cuda, mscash-cuda, mscash2-cuda,
phpass-cuda, pwsafe-cuda, Raw-SHA512-cuda, wpapsk-cuda, xsha512-cuda,
Raw-SHA224-cuda, Raw-SHA256-cuda

The cluster currently looks like this…

Building the cluster (integrating John with CUDA support and OpenMPI) was straightforward with only one minor glitch.

I am operating with:

R21 (release), REVISION: 5.0, GCID: 7273100, BOARD: ardbeg, EABI: hard, DATE:

Wed Jun 8 04:19:09 UTC 2016

After you compile John in order to run it via cluster (I am NFS mounting the john “run” directory so that all nodes in the cluster can read/write and execute it’s contents) you must add a new directory to your LD environment.

Normally that is fine, but there is an error in the shared libraries for 21.50

I compiled and ran john using standalone (single node) mode and everything went well. When I tried to execute the binaries under OpenMPI framework (via mpirun) it spewed errors about not finding a library. I updated ld.so.conf.d/nvidia-tegra.conf to add this:

/usr/lib/arm-linux-gnueabihf/tegra

added below

/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib

However when I ran ldconfig, it produced an error!

root@gpu02:/etc/ld.so.conf.d# ldconfig
/sbin/ldconfig.real: /usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib/libcudnn.so.6.5 is not a symbolic link

root@gpu02:/etc/ld.so.conf.d# cd /usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# ls -l cudnn
-rwxr-xr-x 1 root root 8978224 Apr 26 21:49 libcudnn.so
-rwxr-xr-x 1 root root 8978224 Apr 26 21:49 libcudnn.so.6.5
-rwxr-xr-x 1 root root 8978224 Apr 26 21:49 libcudnn.so.6.5.48
-rwxr-xr-x 1 root root 9308614 Apr 26 21:49 libcudnn_static.a

You have to nuke the two incorrectly installed shared libraries i.e.

root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# rm libcudnn.so libcudnn.so.6.5
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# ln -s libcudnn.so.6.5.48 libcudnn.so.6.5
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# ln -s libcudnn.so.6.5.48 libcudnn.so
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# ls -l cudnn
lrwxrwxrwx 1 root root 18 May 25 01:02 libcudnn.so → libcudnn.so.6.5.48
lrwxrwxrwx 1 root root 18 May 25 01:02 libcudnn.so.6.5 → libcudnn.so.6.5.48
-rwxr-xr-x 1 root root 8978224 Apr 26 21:49 libcudnn.so.6.5.48
-rwxr-xr-x 1 root root 9308614 Apr 26 21:49 libcudnn_static.a

root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib# ldconfig
root@gpu02:/usr/local/cuda-6.5/targets/armv7-linux-gnueabihf/lib#
No errors!

Adding the new path to ld.so.conf solved the issue with not being able to load the shared library and the other cluster nodes were able to execute john normally.

The below status output is running on an md5 hash.

mpirun: Forwarding signal 10 to job
1 0g 0:00:21:49 57.16% 2/3 (ETA: 19:56:44) 0g/s 67059p/s 67059c/s 67059C/s Bakenttnekab2…Haafssfaah2
3 0g 0:00:21:49 74.18% 2/3 (ETA: 15:47:58) 0g/s 65704p/s 65704c/s 65704C/s Novelet?..Outrepasserons?
2 0g 0:00:21:49 37.14% 2/3 (ETA: 20:17:18) 0g/s 54131p/s 54131c/s 54131C/s sudsy?..toatoa?
4 0g 0:00:21:49 47.60% 2/3 (ETA: 20:04:24) 0g/s 59701p/s 59701c/s 59701C/s Dortohg…Nerace

ubuntu@gpu01:/master/mpi_tests$ ./stress

  • true
  • mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/system
    gpu02 19:51:01 up 23:31, 1 user, load average: 0.28, 0.26, 0.20
    gpu01 19:51:01 up 6:02, 6 users, load average: 0.85, 0.48, 0.32
    gpu03 15:51:01 up 5:47, 2 users, load average: 0.46, 0.62, 0.46
    gpu04 19:51:00 up 15:02, 1 user, load average: 0.18, 0.24, 0.22
  • mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/helloworld.py
    Hello, Cluster! Python process 1 of 4 on gpu01.
    Hello, Cluster! Python process 2 of 4 on gpu02.
    Hello, Cluster! Python process 3 of 4 on gpu03.
    Hello, Cluster! Python process 4 of 4 on gpu04.
  • mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/cpi
    Process 1 of 4 is on gpu01
    Process 2 of 4 is on gpu02
    Process 4 of 4 is on gpu04
    Process 3 of 4 is on gpu03
    pi is approximately 3.1415926544231239, Error is 0.0000000008333307
    wall clock time = 0.002959
  • true
  • mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/system
    gpu02 19:51:03 up 23:31, 1 user, load average: 0.34, 0.27, 0.21
    gpu03 15:51:03 up 5:47, 2 users, load average: 0.46, 0.62, 0.46
    gpu01 19:51:03 up 6:02, 6 users, load average: 0.94, 0.51, 0.33
    gpu04 19:51:03 up 15:02, 1 user, load average: 0.33, 0.27, 0.22
  • mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/helloworld.py
    Hello, Cluster! Python process 2 of 4 on gpu02.
    Hello, Cluster! Python process 1 of 4 on gpu01.
    Hello, Cluster! Python process 3 of 4 on gpu03.
    Hello, Cluster! Python process 4 of 4 on gpu04.
  • mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/cpi
    Process 1 of 4 is on gpu01
    Process 2 of 4 is on gpu02
    Process 3 of 4 is on gpu03
    Process 4 of 4 is on gpu04
    pi is approximately 3.1415926544231239, Error is 0.0000000008333307
    wall clock time = 0.003526
  • true
  • mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/system
    gpu02 19:51:05 up 23:31, 1 user, load average: 0.34, 0.27, 0.21
    gpu01 19:51:05 up 6:02, 6 users, load average: 0.94, 0.51, 0.33
    gpu03 15:51:05 up 5:47, 2 users, load average: 0.82, 0.70, 0.48
    gpu04 19:51:05 up 15:02, 1 user, load average: 0.33, 0.27, 0.22
  • mpirun -n 4 -machinefile /master/mpi_tests/machinefile /master/mpi_tests/helloworld.py
    ^Cmpirun: killing job…
    ubuntu@gpu01:/master/mpi_tests$

ETA I saw I need to get ntpd running… sigh.

I wanted to add all 4-boards are only drawing a combined 3A (on the DC side.) This was much less than I had expected. I assume performing SHA512 hashing is not very demanding in reality.