Hi,
I’m trying to run example detectnet_v2 and facing an Illegal Instruction error when I execute the tlt-train command. I’m running it on RTX 2080 GPU on a Dell R720 server. Here’s the output log:
root@2a9a93f3988b:/workspace# tlt-train detectnet_v2 -e $SPECS_DIR/detectnet_v2_train_resnet18_kitti.txt -r $USER_EXPERIMENT_DIR/experiment_dir_unpruned -k $KEY -n resnet18_detector
Using TensorFlow backend.
[[2548,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: 2a9a93f3988b
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
2019-10-28 22:38:12.469830: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x65adf50 executing computations on platform CUDA. Devices:
2019-10-28 22:38:12.469899: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce RTX 2080, Compute Capability 7.5
2019-10-28 22:38:12.473510: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3299965000 Hz
2019-10-28 22:38:12.475618: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x6618c90 executing computations on platform Host. Devices:
2019-10-28 22:38:12.475665: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2019-10-28 22:38:12.475894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 major: 7 minor: 5 memoryClockRate(GHz): 1.71
pciBusID: 0000:42:00.0
totalMemory: 7.79GiB freeMemory: 7.68GiB
2019-10-28 22:38:12.475937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-10-28 22:38:12.476720: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-28 22:38:12.476743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-10-28 22:38:12.476757: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-10-28 22:38:12.476895: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7469 MB memory) → physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:42:00.0, compute capability: 7.5)
2019-10-28 22:38:12,478 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/examples/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt.
2019-10-28 22:38:12,479 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/examples/detectnet_v2/specs/detectnet_v2_train_resnet18_kitti.txt
WARNING:tensorflow:From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
tf.data.TFRecordDataset(path)
2019-10-28 22:38:12,493 [WARNING] tensorflow: From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
tf.data.TFRecordDataset(path)
2019-10-28 22:38:12,608 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 6434 samples with a batch size of 4; each epoch will therefore take one extra step.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-10-28 22:38:12,615 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/init.py:91: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
2019-10-28 22:38:12,629 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/init.py:91: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Layer (type) Output Shape Param # Connected to
input_1 (InputLayer) (None, 3, 384, 1248) 0
conv1 (Conv2D) (None, 64, 192, 624) 9472 input_1[0][0]
bn_conv1 (BatchNormalization) (None, 64, 192, 624) 256 conv1[0][0]
activation_1 (Activation) (None, 64, 192, 624) 0 bn_conv1[0][0]
block_1a_conv_1 (Conv2D) (None, 64, 96, 312) 36928 activation_1[0][0]
block_1a_bn_1 (BatchNormalizati (None, 64, 96, 312) 256 block_1a_conv_1[0][0]
activation_2 (Activation) (None, 64, 96, 312) 0 block_1a_bn_1[0][0]
block_1a_conv_2 (Conv2D) (None, 64, 96, 312) 36928 activation_2[0][0]
block_1a_conv_shortcut (Conv2D) (None, 64, 96, 312) 4160 activation_1[0][0]
block_1a_bn_2 (BatchNormalizati (None, 64, 96, 312) 256 block_1a_conv_2[0][0]
block_1a_bn_shortcut (BatchNorm (None, 64, 96, 312) 256 block_1a_conv_shortcut[0][0]
add_1 (Add) (None, 64, 96, 312) 0 block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]
activation_3 (Activation) (None, 64, 96, 312) 0 add_1[0][0]
block_1b_conv_1 (Conv2D) (None, 64, 96, 312) 36928 activation_3[0][0]
block_1b_bn_1 (BatchNormalizati (None, 64, 96, 312) 256 block_1b_conv_1[0][0]
activation_4 (Activation) (None, 64, 96, 312) 0 block_1b_bn_1[0][0]
block_1b_conv_2 (Conv2D) (None, 64, 96, 312) 36928 activation_4[0][0]
block_1b_bn_2 (BatchNormalizati (None, 64, 96, 312) 256 block_1b_conv_2[0][0]
add_2 (Add) (None, 64, 96, 312) 0 block_1b_bn_2[0][0]
activation_3[0][0]
activation_5 (Activation) (None, 64, 96, 312) 0 add_2[0][0]
block_2a_conv_1 (Conv2D) (None, 128, 48, 156) 73856 activation_5[0][0]
block_2a_bn_1 (BatchNormalizati (None, 128, 48, 156) 512 block_2a_conv_1[0][0]
activation_6 (Activation) (None, 128, 48, 156) 0 block_2a_bn_1[0][0]
block_2a_conv_2 (Conv2D) (None, 128, 48, 156) 147584 activation_6[0][0]
block_2a_conv_shortcut (Conv2D) (None, 128, 48, 156) 8320 activation_5[0][0]
block_2a_bn_2 (BatchNormalizati (None, 128, 48, 156) 512 block_2a_conv_2[0][0]
block_2a_bn_shortcut (BatchNorm (None, 128, 48, 156) 512 block_2a_conv_shortcut[0][0]
add_3 (Add) (None, 128, 48, 156) 0 block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]
activation_7 (Activation) (None, 128, 48, 156) 0 add_3[0][0]
block_2b_conv_1 (Conv2D) (None, 128, 48, 156) 147584 activation_7[0][0]
block_2b_bn_1 (BatchNormalizati (None, 128, 48, 156) 512 block_2b_conv_1[0][0]
activation_8 (Activation) (None, 128, 48, 156) 0 block_2b_bn_1[0][0]
block_2b_conv_2 (Conv2D) (None, 128, 48, 156) 147584 activation_8[0][0]
block_2b_bn_2 (BatchNormalizati (None, 128, 48, 156) 512 block_2b_conv_2[0][0]
add_4 (Add) (None, 128, 48, 156) 0 block_2b_bn_2[0][0]
activation_7[0][0]
activation_9 (Activation) (None, 128, 48, 156) 0 add_4[0][0]
block_3a_conv_1 (Conv2D) (None, 256, 24, 78) 295168 activation_9[0][0]
block_3a_bn_1 (BatchNormalizati (None, 256, 24, 78) 1024 block_3a_conv_1[0][0]
activation_10 (Activation) (None, 256, 24, 78) 0 block_3a_bn_1[0][0]
block_3a_conv_2 (Conv2D) (None, 256, 24, 78) 590080 activation_10[0][0]
block_3a_conv_shortcut (Conv2D) (None, 256, 24, 78) 33024 activation_9[0][0]
block_3a_bn_2 (BatchNormalizati (None, 256, 24, 78) 1024 block_3a_conv_2[0][0]
block_3a_bn_shortcut (BatchNorm (None, 256, 24, 78) 1024 block_3a_conv_shortcut[0][0]
add_5 (Add) (None, 256, 24, 78) 0 block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]
activation_11 (Activation) (None, 256, 24, 78) 0 add_5[0][0]
block_3b_conv_1 (Conv2D) (None, 256, 24, 78) 590080 activation_11[0][0]
block_3b_bn_1 (BatchNormalizati (None, 256, 24, 78) 1024 block_3b_conv_1[0][0]
activation_12 (Activation) (None, 256, 24, 78) 0 block_3b_bn_1[0][0]
block_3b_conv_2 (Conv2D) (None, 256, 24, 78) 590080 activation_12[0][0]
block_3b_bn_2 (BatchNormalizati (None, 256, 24, 78) 1024 block_3b_conv_2[0][0]
add_6 (Add) (None, 256, 24, 78) 0 block_3b_bn_2[0][0]
activation_11[0][0]
activation_13 (Activation) (None, 256, 24, 78) 0 add_6[0][0]
block_4a_conv_1 (Conv2D) (None, 512, 24, 78) 1180160 activation_13[0][0]
block_4a_bn_1 (BatchNormalizati (None, 512, 24, 78) 2048 block_4a_conv_1[0][0]
activation_14 (Activation) (None, 512, 24, 78) 0 block_4a_bn_1[0][0]
block_4a_conv_2 (Conv2D) (None, 512, 24, 78) 2359808 activation_14[0][0]
block_4a_conv_shortcut (Conv2D) (None, 512, 24, 78) 131584 activation_13[0][0]
block_4a_bn_2 (BatchNormalizati (None, 512, 24, 78) 2048 block_4a_conv_2[0][0]
block_4a_bn_shortcut (BatchNorm (None, 512, 24, 78) 2048 block_4a_conv_shortcut[0][0]
add_7 (Add) (None, 512, 24, 78) 0 block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]
activation_15 (Activation) (None, 512, 24, 78) 0 add_7[0][0]
block_4b_conv_1 (Conv2D) (None, 512, 24, 78) 2359808 activation_15[0][0]
block_4b_bn_1 (BatchNormalizati (None, 512, 24, 78) 2048 block_4b_conv_1[0][0]
activation_16 (Activation) (None, 512, 24, 78) 0 block_4b_bn_1[0][0]
block_4b_conv_2 (Conv2D) (None, 512, 24, 78) 2359808 activation_16[0][0]
block_4b_bn_2 (BatchNormalizati (None, 512, 24, 78) 2048 block_4b_conv_2[0][0]
add_8 (Add) (None, 512, 24, 78) 0 block_4b_bn_2[0][0]
activation_15[0][0]
activation_17 (Activation) (None, 512, 24, 78) 0 add_8[0][0]
output_bbox (Conv2D) (None, 12, 24, 78) 6156 activation_17[0][0]
output_cov (Conv2D) (None, 3, 24, 78) 1539 activation_17[0][0]
Total params: 11,203,023
Trainable params: 11,193,295
Non-trainable params: 9,728
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2019-10-28 22:38:28,811 [INFO] iva.detectnet_v2.scripts.train: Found 6434 samples in training set
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2019-10-28 22:38:36,257 [INFO] iva.detectnet_v2.scripts.train: Found 1047 samples in validation set
INFO:tensorflow:Create CheckpointSaverHook.
2019-10-28 22:38:42,325 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2019-10-28 22:38:43,744 [INFO] tensorflow: Graph was finalized.
2019-10-28 22:38:43.745938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-10-28 22:38:43.746047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-10-28 22:38:43.746066: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2019-10-28 22:38:43.746081: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2019-10-28 22:38:43.746247: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 7469 MB memory) → physical GPU (device: 0, name: GeForce RTX 2080, pci bus id: 0000:42:00.0, compute capability: 7.5)
INFO:tensorflow:Running local_init_op.
2019-10-28 22:38:46,530 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2019-10-28 22:38:46,878 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2019-10-28 22:39:01,161 [INFO] tensorflow: Saving checkpoints for step-0.
2019-10-28 22:39:29.080295: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2019-10-28 22:39:29.386243: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6674b30
/usr/local/bin/tlt-train: line 32: 49 Illegal instruction (core dumped) tlt-train-g1 ${PYTHON_ARGS[*]}
Any ideas?