Tao multiple - GPUs

INFO:tensorflow:Running local_init_op.
2022-01-12 12:43:01,237 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-01-12 12:43:01,799 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2022-01-12 12:43:02,029 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2022-01-12 12:43:04,971 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2022-01-12 12:43:05,778 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2022-01-12 12:43:19,444 [INFO] tensorflow: Saving checkpoints for step-0.
3b0a02cd3486:66:135 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
3b0a02cd3486:66:135 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
3b0a02cd3486:66:135 [0] NCCL INFO NET/IB : No device found.
3b0a02cd3486:66:135 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
3b0a02cd3486:66:135 [0] NCCL INFO Using network Socket
NCCL version 2.9.9+cuda11.3
3b0a02cd3486:67:134 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
3b0a02cd3486:67:134 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
3b0a02cd3486:67:134 [1] NCCL INFO NET/IB : No device found.
3b0a02cd3486:67:134 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0> [1]eth0:172.17.0.2<0>
3b0a02cd3486:67:134 [1] NCCL INFO Using network Socket
3b0a02cd3486:66:135 [0] NCCL INFO Channel 00/02 : 0 1
3b0a02cd3486:66:135 [0] NCCL INFO Channel 01/02 : 0 1
3b0a02cd3486:66:135 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
3b0a02cd3486:67:134 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
3b0a02cd3486:67:134 [1] NCCL INFO Channel 00 : 1[7000] → 0[4000] via P2P/IPC
3b0a02cd3486:67:134 [1] NCCL INFO Channel 01 : 1[7000] → 0[4000] via P2P/IPC
3b0a02cd3486:66:135 [0] NCCL INFO Channel 00 : 0[4000] → 1[7000] via P2P/IPC
3b0a02cd3486:66:135 [0] NCCL INFO Channel 01 : 0[4000] → 1[7000] via P2P/IPC
3b0a02cd3486:67:134 [1] NCCL INFO Connected all rings
3b0a02cd3486:67:134 [1] NCCL INFO Connected all trees
3b0a02cd3486:66:135 [0] NCCL INFO Connected all rings
3b0a02cd3486:66:135 [0] NCCL INFO Connected all trees
3b0a02cd3486:67:134 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
3b0a02cd3486:67:134 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
3b0a02cd3486:66:135 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
3b0a02cd3486:66:135 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
3b0a02cd3486:67:134 [1] NCCL INFO comm 0x7f9340335440 rank 1 nranks 2 cudaDev 1 busId 7000 - Init COMPLETE
3b0a02cd3486:66:135 [0] NCCL INFO comm 0x7ff58c3a42c0 rank 0 nranks 2 cudaDev 0 busId 4000 - Init COMPLETE
3b0a02cd3486:66:135 [0] NCCL INFO Launch mode Parallel

Is this an Error or else just a message? There are no logs printed!

There is not error info.

The training doesn’t start, GPU is being used but even after hours of waiting nothing gets saved…

How did you trigger docker and how did you run the training? Can you share full command as well?

Hi,

The training started with one parameter being disabled, and the usage of all the GPU’s… thank you!

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Can you share the command and log? Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.