Issue in experimenting with transfer learning

Hi ,
I am trying to train the network with more custom training
Any idea what could be the reason ?

Getting Segmentation fault (core dumped)

I tried to follow steps from https://www.youtube.com/watch?v=pzJFzwHKDRo&t=624s

crevavi@crevavi-desktop:~/jetson-inference/python/training/classification$ python train.py --model-dir=utensils ~/datasets/utensils/ --epochs=1 --batch-size=4
Use GPU: 0 for training
=> dataset classes: 3 [‘background’, ‘fork’, ‘spoon’]
=> using pre-trained model ‘resnet18’
=> reshaped ResNet fully-connected layer with: Linear(in_features=512, out_features=3, bias=True)
Epoch: [0][ 0/23] Time 60.684 (60.684) Data 1.600 ( 1.600) Loss 1.6361e+00 (1.6361e+00) Acc@1 25.00 ( 25.00) Acc@5 100.00 (100.00)
Epoch: [0][10/23] Time 0.751 ( 6.571) Data 0.000 ( 0.157) Loss 1.4579e+00 (1.4134e+01) Acc@1 50.00 ( 36.36) Acc@5 100.00 (100.00)
Epoch: [0][20/23] Time 0.752 ( 3.801) Data 0.000 ( 0.102) Loss 4.4567e+00 (1.6097e+01) Acc@1 50.00 ( 35.71) Acc@5 100.00 (100.00)
Epoch: [0] completed, elapsed time 85.990 seconds
Test: [ 0/23] Time 2.100 ( 2.100) Loss 2.8461e+05 (2.8461e+05) Acc@1 0.00 ( 0.00) Acc@5 100.00 (100.00)
Test: [10/23] Time 0.273 ( 0.438) Loss 0.0000e+00 (2.4456e+05) Acc@1 100.00 ( 29.55) Acc@5 100.00 (100.00)
Test: [20/23] Time 0.268 ( 0.359) Loss 1.3643e+05 (1.6559e+05) Acc@1 0.00 ( 35.71) Acc@5 100.00 (100.00)

  • Acc@1 32.967 Acc@5 100.000
    saved best model to: utensils/model_best.pth.tar
    Segmentation fault (core dumped)
    crevavi@crevavi-desktop:~/jetson-inference/python/training/classification$

Hi @sunil76joshi, the error only occurs after training has completed, so you an ignore it for now.

You may want to train it for more than one epoch to see if the accuracy improves.

Hi @dusty_nv,
Thanks for the quick response. I tried for default 34 epochs as well. Facing the same issue.

After I tried
python train.py --model-dir=utensils ~/datasets/utensils/

I went ahead with next command as below
python onnx_export.py --model-dir=utensils
and then
imagenet-camera --model=utensils/resnet18.onnx --input_blob=input_0 --output_blob=output0 --lables=/home/crevavi/datasets/utensils/labels.txt --camera=/dev/video0 --width=640 --height=480

I keep getting segmentation fault at every step.

%output_0 : Float(1, 3) = onnx::Softmaxaxis=1 # /home/crevavi/.local/lib/python3.6/site-packages/torch/nn/functional.py:1231:0
return (%output_0)

model exported to: utensils/resnet18.onnx
Segmentation fault (core dumped)
crevavi@crevavi-desktop:~/jetson-inference/python/training/classification$ imagenet-camera --model=utensils/resnet18.onnx --input_blob=input_0 --output_blob=output0 --lables=/home/crevavi/datasets/utensils/labels.txt --camera=/dev/video0 --width=640 --height=480
[gstreamer] initialized gstreamer, version 1.14.5.0
[gstreamer] gstCamera attempting to initialize with GST_SOURCE_NVARGUS, camera /dev/video0
[gstreamer] gstCamera pipeline string:
v4l2src device=/dev/video0 ! video/x-raw, width=(int)640, height=(int)480, format=YUY2 ! videoconvert ! video/x-raw, format=RGB ! videoconvert !appsink name=mysink
[gstreamer] gstCamera successfully initialized with GST_SOURCE_V4L2, camera /dev/video0

imagenet-camera: successfully initialized camera device
width: 640
height: 480
depth: 24 (bpp)

[TRT] imageNet – failed to initialize.
imagenet-console: failed to initialize imageNet
crevavi@crevavi-desktop:~/jetson-inference/python/training/classification$

There is a typo in your command line - --lables should be --labels.

Also I recommend to try imagenet-console first on a test image, before jumping to camera.

oops… my bad…
I tried

imagenet-console --model=utensils/resnet18.onnx --input_blob=input_0 --output_blob=output0 --labels=/home/crevavi/datasets/utensils/labels.txt 16062020-153825.jpg

I see
[TRT] binding – index 1
– name ‘output_0’
– type FP32
– in/out OUTPUT
– # dims 2
– dim #0 1 (SPATIAL)
– dim #1 3 (SPATIAL)
[TRT] binding to input 0 input_0 binding index: 0
[TRT] binding to input 0 input_0 dims (b=1 c=3 h=224 w=224) size=602112
[TRT] INVALID_ARGUMENT: Cannot find binding of given name: output0
[TRT] binding to output 0 output0 binding index: -1
[TRT] Parameter check failed at: engine.cpp::getBindingDimensions::1977, condition: bindIndex >= 0 && bindIndex < getNbBindings()
[TRT] binding to output 0 output0 dims (b=1 c=1 h=1 w=1) size=4
device GPU, utensils/resnet18.onnx initialized.
[TRT] utensils/resnet18.onnx loaded
imageNet – loaded 3 class info entries
imageNet – didn’t load expected number of class descriptions (3 of 1)
imageNet – failed to load synset class descriptions (3 / 3 of 1)
[TRT] imageNet – failed to initialize.
imagenet-console: failed to initialize imageNet

I wonder is there is any memory allocation issue …

I think this is missing an underscore, it should be --output_blob=output_0 instead.

Ohh yes, i did many typos … sorry for bothering you with that…
It looks working now…
I will do more training stuff and look for better accuracy.

Thanks a lot for super fast support !! I really appreciate it!

No problem, glad you got it working!

You can ignore the PyTorch crashes for now, it only happens when PyTorch exits. I am trying to figure out why it happens.

I am also getting the segmentation fault error when I am running the sample classification training for Plant, Cat and Dog in Nvidia helloworld

Epoch: [34][0/8] Time 0.470 ( 0.470) Data 0.356 ( 0.356) Loss 0.0000e+00 (0.0000e+00) Acc@1 100.00 (100.00) Acc@5 100.00 (100.00)
Epoch: [34] completed, elapsed time 1.620 seconds
Test: [0/2] Time 0.423 ( 0.423) Loss 0.0000e+00 (0.0000e+00) Acc@1 100.00 (100.00) Acc@5 100.00 (100.00)

  • Acc@1 100.000 Acc@5 100.000
    saved checkpoint to: perrier/checkpoint.pth.tar
    Segmentation fault (core dumped)

Hi kwok.paul,

Please open a new topic for your issue. Thanks