So doing that with 4 GPU, I get the following error
2023-03-22 14:21:11,533 [INFO] __main__: Found 1400 samples in validation set
2023-03-22 14:21:11,533 [INFO] root: Rasterizing tensors.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:11,658 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:21:11,762 [INFO] root: Tensors rasterized.
2023-03-22 14:21:12,110 [INFO] root: Validation graph built.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:12,944 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:13,403 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:13,444 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:21:13,455 [INFO] root: Running training loop.
2023-03-22 14:21:13,456 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:21:13,456 [INFO] __main__: Scalars logged at every 10 steps
2023-03-22 14:21:13,456 [INFO] __main__: Images logged at every 2690 steps
INFO:tensorflow:Create CheckpointSaverHook.
2023-03-22 14:21:13,461 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:13,919 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2023-03-22 14:21:16,203 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:21:18,931 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:21:19,679 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-03-22 14:21:30,357 [INFO] tensorflow: Saving checkpoints for step-0.
5d5684b73250:6360:6921 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:6360:6921 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
5d5684b73250:6360:6921 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:6360:6921 [0] NCCL INFO P2P plugin IBext
5d5684b73250:6360:6921 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:6360:6921 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:6360:6921 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:6360:6921 [0] NCCL INFO Using network Socket
5d5684b73250:6360:6921 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:6360:6921 [0] NCCL INFO Channel 00/04 : 0 1 2 3
5d5684b73250:6360:6921 [0] NCCL INFO Channel 01/04 : 0 3 2 1
5d5684b73250:6360:6921 [0] NCCL INFO Channel 02/04 : 0 1 2 3
5d5684b73250:6360:6921 [0] NCCL INFO Channel 03/04 : 0 3 2 1
5d5684b73250:6360:6921 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 3/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 3/-1/-1->0->-1
5d5684b73250:6360:6921 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:6360:6921 [0] NCCL INFO Channel 02 : 0[3b000] -> 1[5e000] via SHM/direct/direct
[5d5684b73250:6360 :0:6921] Caught signal 7 (Bus error: nonexistent physical address)
[5d5684b73250:6363 :0:6931] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 6921) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bb41 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000007b0ef ncclGroupEnd() ???:0
4 0x0000000000059e97 ncclGetUniqueId() ???:0
5 0x00000000000489b1 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
6 0x000000000004a655 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
7 0x00000000000652a6 ncclRedOpDestroy() ???:0
8 0x000000000004ae3b ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
9 0x000000000004b098 ncclCommInitRank() ???:0
10 0x0000000000118354 horovod::common::NCCLOpContext::InitNCCLComm() /opt/horovod/horovod/common/ops/nccl_operations.cc:113
11 0x0000000000118581 horovod::common::NCCLAllreduce::Execute() /opt/horovod/horovod/common/ops/nccl_operations.cc:180
12 0x00000000000da3cd horovod::common::OperationManager::ExecuteAllreduce() /opt/horovod/horovod/common/ops/operation_manager.cc:46
13 0x00000000000da7fc horovod::common::OperationManager::ExecuteOperation() /opt/horovod/horovod/common/ops/operation_manager.cc:112
14 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:297
15 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=() /usr/include/c++/9/bits/shared_ptr_base.h:1265
16 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=() /usr/include/c++/9/bits/shared_ptr.h:335
17 0x00000000000a902d horovod::common::Event::operator=() /opt/horovod/horovod/common/common.h:185
18 0x00000000000a902d horovod::common::Status::operator=() /opt/horovod/horovod/common/common.h:197
19 0x00000000000a902d PerformOperation() /opt/horovod/horovod/common/operations.cc:297
20 0x00000000000a902d RunLoopOnce() /opt/horovod/horovod/common/operations.cc:787
21 0x00000000000a902d BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition() ???:0
23 0x0000000000008609 start_thread() ???:0
24 0x000000000011f133 clone() ???:0
=================================
[5d5684b73250:06360] *** Process received signal ***
[5d5684b73250:06360] Signal: Bus error (7)
[5d5684b73250:06360] Signal code: (-6)
[5d5684b73250:06360] Failing at address: 0x18d8
[5d5684b73250:06360] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7f2d243b0090]
[5d5684b73250:06360] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7f2d244f8b41]
[5d5684b73250:06360] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7f2c15ce287d]
[5d5684b73250:06360] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7b0ef)[0x7f2c15ce80ef]
[5d5684b73250:06360] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x59e97)[0x7f2c15cc6e97]
[5d5684b73250:06360] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x489b1)[0x7f2c15cb59b1]
[5d5684b73250:06360] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4a655)[0x7f2c15cb7655]
[5d5684b73250:06360] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x652a6)[0x7f2c15cd22a6]
[5d5684b73250:06360] [ 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4ae3b)[0x7f2c15cb7e3b]
[5d5684b73250:06360] [ 9] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommInitRank+0xd8)[0x7f2c15cb8098]
[5d5684b73250:06360] [10] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLOpContext12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x284)[0x7f2b6d7c1354]
[5d5684b73250:06360] [11] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x61)[0x7f2b6d7c1581]
[5d5684b73250:06360] [12] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7f2b6d7833cd]
[5d5684b73250:06360] [13] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7f2b6d7837fc]
[5d5684b73250:06360] [14] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7f2b6d75202d]
[5d5684b73250:06360] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7f2d23718de4]
[5d5684b73250:06360] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7f2d24352609]
[5d5684b73250:06360] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7f2d2448c133]
[5d5684b73250:06360] *** End of error message ***
==== backtrace (tid: 6931) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bb41 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000007b0ef ncclGroupEnd() ???:0
4 0x0000000000059e97 ncclGetUniqueId() ???:0
5 0x00000000000489b1 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
6 0x000000000004a655 ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
7 0x00000000000652a6 ncclRedOpDestroy() ???:0
8 0x000000000004ae3b ???() /usr/lib/x86_64-linux-gnu/libnccl.so.2:0
9 0x000000000004b098 ncclCommInitRank() ???:0
10 0x0000000000118354 horovod::common::NCCLOpContext::InitNCCLComm() /opt/horovod/horovod/common/ops/nccl_operations.cc:113
11 0x0000000000118581 horovod::common::NCCLAllreduce::Execute() /opt/horovod/horovod/common/ops/nccl_operations.cc:180
12 0x00000000000da3cd horovod::common::OperationManager::ExecuteAllreduce() /opt/horovod/horovod/common/ops/operation_manager.cc:46
13 0x00000000000da7fc horovod::common::OperationManager::ExecuteOperation() /opt/horovod/horovod/common/ops/operation_manager.cc:112
14 0x00000000000a902d horovod::common::(anonymous namespace)::BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:297
15 0x00000000000a902d std::__shared_ptr<CUevent_st*, (__gnu_cxx::_Lock_policy)2>::operator=() /usr/include/c++/9/bits/shared_ptr_base.h:1265
16 0x00000000000a902d std::shared_ptr<CUevent_st*>::operator=() /usr/include/c++/9/bits/shared_ptr.h:335
17 0x00000000000a902d horovod::common::Event::operator=() /opt/horovod/horovod/common/common.h:185
18 0x00000000000a902d horovod::common::Status::operator=() /opt/horovod/horovod/common/common.h:197
19 0x00000000000a902d PerformOperation() /opt/horovod/horovod/common/operations.cc:297
20 0x00000000000a902d RunLoopOnce() /opt/horovod/horovod/common/operations.cc:787
21 0x00000000000a902d BackgroundThreadLoop() /opt/horovod/horovod/common/operations.cc:651
22 0x00000000000d6de4 std::error_code::default_error_condition() ???:0
23 0x0000000000008609 start_thread() ???:0
24 0x000000000011f133 clone() ???:0
=================================
[5d5684b73250:06363] *** Process received signal ***
[5d5684b73250:06363] Signal: Bus error (7)
[5d5684b73250:06363] Signal code: (-6)
[5d5684b73250:06363] Failing at address: 0x18db
[5d5684b73250:06363] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff118fff090]
[5d5684b73250:06363] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7ff119147b41]
[5d5684b73250:06363] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7ff00a93187d]
[5d5684b73250:06363] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7b0ef)[0x7ff00a9370ef]
[5d5684b73250:06363] [ 4] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x59e97)[0x7ff00a915e97]
[5d5684b73250:06363] [ 5] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x489b1)[0x7ff00a9049b1]
[5d5684b73250:06363] [ 6] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4a655)[0x7ff00a906655]
[5d5684b73250:06363] [ 7] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x652a6)[0x7ff00a9212a6]
[5d5684b73250:06363] [ 8] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x4ae3b)[0x7ff00a906e3b]
[5d5684b73250:06363] [ 9] /usr/lib/x86_64-linux-gnu/libnccl.so.2(ncclCommInitRank+0xd8)[0x7ff00a907098]
[5d5684b73250:06363] [10] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLOpContext12InitNCCLCommERKSt6vectorINS0_16TensorTableEntryESaIS3_EERKS2_IiSaIiEE+0x284)[0x7ff005d87354]
[5d5684b73250:06363] [11] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZN7horovod6common13NCCLAllreduce7ExecuteERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x61)[0x7ff005d87581]
[5d5684b73250:06363] [12] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteAllreduceERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseE+0x7d)[0x7ff005d493cd]
[5d5684b73250:06363] [13] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(_ZNK7horovod6common16OperationManager16ExecuteOperationERSt6vectorINS0_16TensorTableEntryESaIS3_EERKNS0_8ResponseERNS0_10ProcessSetE+0x4c)[0x7ff005d497fc]
[5d5684b73250:06363] [14] /usr/local/lib/python3.6/dist-packages/horovod/tensorflow/mpi_lib.cpython-36m-x86_64-linux-gnu.so(+0xa902d)[0x7ff005d1802d]
[5d5684b73250:06363] [15] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x7ff118367de4]
[5d5684b73250:06363] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7ff118fa1609]
[5d5684b73250:06363] [17] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff1190db133]
[5d5684b73250:06363] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 5d5684b73250 exited on signal 7 (Bus error).
For 3 GPUs , I get this error:
2023-03-22 14:24:52,808 [INFO] root: Tensors rasterized.
2023-03-22 14:24:53,000 [INFO] __main__: Found 8600 samples in training set
2023-03-22 14:24:53,006 [INFO] root: Rasterizing tensors.
2023-03-22 14:24:53,224 [INFO] root: Tensors rasterized.
2023-03-22 14:24:53,493 [INFO] root: Training graph built.
2023-03-22 14:24:53,493 [INFO] root: Running training loop.
2023-03-22 14:24:53,493 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:53,493 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:53,493 [INFO] __main__: Images logged at every 0 steps
2023-03-22 14:24:54,763 [INFO] root: Training graph built.
2023-03-22 14:24:54,763 [INFO] root: Running training loop.
2023-03-22 14:24:54,763 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:54,763 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:54,763 [INFO] __main__: Images logged at every 0 steps
INFO:tensorflow:Graph was finalized.
2023-03-22 14:24:54,791 [INFO] tensorflow: Graph was finalized.
2023-03-22 14:24:56,127 [INFO] root: Training graph built.
2023-03-22 14:24:56,127 [INFO] root: Building validation graph.
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 80, io threads: 160, compute threads: 80, buffered batches: 4
2023-03-22 14:24:56,128 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 1400, number of sources: 1, batch size per gpu: 4, steps: 350
WARNING:tensorflow:Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,140 [WARNING] tensorflow: Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7f0317ebf198>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
INFO:tensorflow:Graph was finalized.
2023-03-22 14:24:56,145 [INFO] tensorflow: Graph was finalized.
2023-03-22 14:24:56,159 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2023-03-22 14:24:56,393 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2023-03-22 14:24:56,397 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2023-03-22 14:24:56,397 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,412 [WARNING] tensorflow: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7f01cc648438>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-03-22 14:24:56,640 [INFO] __main__: Found 1400 samples in validation set
2023-03-22 14:24:56,640 [INFO] root: Rasterizing tensors.
2023-03-22 14:24:56,857 [INFO] root: Tensors rasterized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:24:57,104 [INFO] tensorflow: Running local_init_op.
2023-03-22 14:24:57,184 [INFO] root: Validation graph built.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:24:57,570 [INFO] tensorflow: Done running local_init_op.
2023-03-22 14:24:58,493 [INFO] root: Running training loop.
2023-03-22 14:24:58,494 [INFO] __main__: Checkpoint interval: 10
2023-03-22 14:24:58,494 [INFO] __main__: Scalars logged at every 14 steps
2023-03-22 14:24:58,494 [INFO] __main__: Images logged at every 3585 steps
INFO:tensorflow:Create CheckpointSaverHook.
2023-03-22 14:24:58,497 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:24:58,607 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:24:59,102 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Graph was finalized.
2023-03-22 14:25:01,129 [INFO] tensorflow: Graph was finalized.
INFO:tensorflow:Running local_init_op.
2023-03-22 14:25:03,774 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2023-03-22 14:25:04,510 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2023-03-22 14:25:14,773 [INFO] tensorflow: Saving checkpoints for step-0.
5d5684b73250:8162:8585 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v5)
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v5)
5d5684b73250:8162:8585 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.15.5+cuda11.8
5d5684b73250:8162:8585 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
5d5684b73250:8162:8585 [0] NCCL INFO P2P plugin IBext
5d5684b73250:8162:8585 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:8162:8585 [0] NCCL INFO NET/IB : No device found.
5d5684b73250:8162:8585 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
5d5684b73250:8162:8585 [0] NCCL INFO Using network Socket
5d5684b73250:8162:8585 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
5d5684b73250:8162:8585 [0] NCCL INFO Channel 00/02 : 0 1 2
5d5684b73250:8162:8585 [0] NCCL INFO Channel 01/02 : 0 1 2
5d5684b73250:8162:8585 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
5d5684b73250:8162:8585 [0] NCCL INFO Channel 00 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:8162:8585 [0] NCCL INFO Channel 01 : 0[3b000] -> 1[5e000] via SHM/direct/direct
5d5684b73250:8162:8585 [0] NCCL INFO Connected all rings
[5d5684b73250:8165 :0:9402] Caught signal 7 (Bus error: nonexistent physical address)
[5d5684b73250:8162 :0:8585] Caught signal 7 (Bus error: nonexistent physical address)
==== backtrace (tid: 9402) ====
0 0x0000000000043090 killpg() ???:0
1 0x000000000018bb41 __nss_database_lookup() ???:0
2 0x000000000007587d ncclGroupEnd() ???:0
3 0x000000000006b246 ncclGroupEnd() ???:0
4 0x0000000000008609 start_thread() ???:0
5 0x000000000011f133 clone() ???:0
=================================
[5d5684b73250:08165] *** Process received signal ***
[5d5684b73250:08165] Signal: Bus error (7)
[5d5684b73250:08165] Signal code: (-6)
[5d5684b73250:08165] Failing at address: 0x1fe5
[5d5684b73250:08165] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7ff4efdb0090]
[5d5684b73250:08165] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x18bb41)[0x7ff4efef8b41]
[5d5684b73250:08165] [ 2] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x7587d)[0x7ff3e16e287d]
[5d5684b73250:08165] [ 3] /usr/lib/x86_64-linux-gnu/libnccl.so.2(+0x6b246)[0x7ff3e16d8246]
[5d5684b73250:08165] [ 4] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x8609)[0x7ff4efd52609]
[5d5684b73250:08165] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x7ff4efe8c133]
[5d5684b73250:08165] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node 5d5684b73250 exited on signal 7 (Bus error).