Dusty-nv jetson training custom data sets generating labels

Hey,

I am trying to train my own data set that I have curated on my pc thosuands of images, using labelimage to label them.

I notice that when using the provided train_ssd.py it looks for specific label csv files. I am guessing they are generated when using the camera capture app or something but there does not seem to be a tool for generating these labels from a custom labelled data set! Am I missing something?

What label format does it use and how would I convert the standard labelimage csvs into the correct format for train_ssd?

No such file or directory: 'data\\training/sub-train-annotations-bbox.csv'

so the first line of my train_labels.csv looks like this:

filename,width,height,class,xmin,ymin,xmax,ymax

I downloaded the example data set of fruit in the tutorial and the label csv files are weird there are two sets for each data set, this is first line from sub-train-annotations-bbox.csv:

ImageID,Source,LabelName,Confidence,XMin,XMax,YMin,YMax,IsOccluded,IsTruncated,IsGroupOf,IsDepiction,IsInside,id,ClassName

and this is first line from train-annotations-bbox.csv:

ImageID,Source,LabelName,Confidence,XMin,XMax,YMin,YMax,IsOccluded,IsTruncated,IsGroupOf,IsDepiction,IsInside

I have managed to train custom model using object detection API but it is not the same as using dysty-nvs tools as it is much more useful to using TRT on the jetson. I would like to train it using the correct tools so everything runs smooth on deployment.

Thanks!

Hi @tom.parkerete4t, those CSV files are for the Google Open Images dataset, and not the Pascal VOC format. Pascal VOC format is what is used by custom detection datasets (which is why train_ssd.py --dataset-type=voc is used when training on custom datasets)

Although labelimg exports the annotation in XML format that is compatible with Pascal VOC, I believe labelimg doesn’t create the expected directory structure of VOC (which is one reason that I now recommend using CVAT instead of labelimg). However it should be fairly straightforward to use your annotations by organizing them in the correct directory structure. You can download the actual Pascal VOC dataset and see how it is organized:

/your_dataset
    /Annotations    (your XML files go under here)
    /ImageSets
       /Main
            test.txt
            train.txt
            trainval.txt
            val.txt
    /JPEGImages     (your images go under here)
    labels.txt

Thank you so much!! awesomes! I totally missed the dataset type voc argument… :D

sorry but is there any way I can invoke the detectnet code to generate these annotations from my current xml files? Its just I manually labelled 25k images it took me nearly a whole month and I have spent two days searching for how to translate this over to the right setup to retrain on your detectnet. I have done over 30 hours training using the detection API only to find I am unable to convert that frozen model to uff on jetson (which I am also trying to solve as well as this, see which I solve first! nightmare…) so this is why I now need to retrain using your system.

Just trying to find a way around it. I did come across this and was hoping I could open my labelimg xmls in the app and export the right structure but I cannot get it to run at all so I am stuck and it is in java, im a junior python dev hehe https://github.com/szaza/dataset-generator. Its the only one ive found that looks like I could open the labelimg xmls and just export anew…The search continues but I am running out of time.

Any advice appreciated.

the most annoying this is I have trained a really good detection system that works amazing on my pc so I know its good and the work paid off. Getting it to run on the jetson is crushing me lol

I believe the labelimg XML files are in the same format as Pascal VOC, you just need to organize them into the expected directory structure outlined above. You can download the Pascal VOC and check it out to make sure they match.

yeah the xml files are usable however the test.txt train.txt. tranval.txt val.txt and labels.txt are not generated and I don’t know what info is used for them.

inside the train.txt files:

2008_000008
2008_000015

inside trainval.txt (what are the numbers by its side for, 1 or -1??)

2008_000002 -1
2008_000003  1
2008_000007 -1

there is no labels.txt in the VOC data set I downloaded from the Pascal website nor in the dusty-nv github so I dont know what the contents should look like.

I can wrtie a quick script to generate the txt with the filenames in the directory, but what are the 1 or -1 for next to the filenames?

and why is there trainval.txt and val.txt there is only one validation image set?

nearly there :) thanks for the help

Hi @tom.parkerete4t, the labels.txt file should contain one class name per line, like so:

car
bike
person

These class names should match the object classes referenced in your annotation XML files.

For the ImageSet text files, it is one image ID per line (not including the image extension). You don’t need the 1 or -1 at the end of the lines.
Ideally it would be a random split with 80% train, 10% test, 10% val (or 70/15/15 split). However for a quick test you can just duplicate them and have all the images in train/test/val.

trainval is simply the combination of train and val. However if you are doing it the quick way and duplicating the ImageSet files, don’t have two copies of all the image IDs in trainval.txt - just have it be the same as train.txt, ect.

BTW if your dataset is big you can run the pytorch-ssd code on PC/server with discrete GPU, you just need to have PyTorch and the dependencies installed there. You can export the model to ONNX on the PC - ONNX isn’t device specific. What needs done on-device is building the TensorRT engine from the ONNX model.

thank you for clearing this up! I spent all morning getting tf2onnx working on jetson only to find that didnt work with my API trained model either lol!

I will sort out these labels and just re-train on your detectnet hopefully I will be able to get this deployed!!

Thanks again I have been pulling my hair out trying to solve this.

oh one q, when you say train on pc with discrete GPU what exactly do you mean by using descrete? is that an argument I don’t see it as an argument in tran ssd py.

I already have your repo running set up on my pc in pycharm with same environment libraries as my jetson.

Ok great so got it running however it is not liking the pth file I feed I think. I am using mobilenet v2 ssd lite from https://storage.googleapis.com/models-hao/mb2-ssd-lite-mp-0_686.pth

and this is the output:

2021-04-16 15:45:12 - Init from pretrained ssd models/mb2-ssd-lite-mp-0_686.pth
2021-04-16 15:45:12 - Took 0.04 seconds to load the model.
2021-04-16 15:45:12 - Learning rate: 0.01, Base net learning rate: 0.001, Extra Layers learning rate: 0.01.
2021-04-16 15:45:12 - Uses CosineAnnealingLR scheduler.
2021-04-16 15:45:12 - Start training from epoch 0.
C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.
step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTo
rch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
Traceback (most recent call last):
  File "train_ssd.py", line 344, in <module>
    device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
  File "train_ssd.py", line 114, in train
    for i, data in enumerate(loader):
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataloader.py", line 352, in __iter__
    return self._get_iterator()
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataloader.py", line 294, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataloader.py", line 801, in __init__
    w.start()
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\popen_spawn_win32.py", line 89, in __init__
    reduction.dump(process_obj, to_child)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'TrainAugmentation.__init__.<locals>.<lambda>'

(jetson_dev) F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "c:\users\muayt\appdata\local\programs\python\python37\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

So close! hehe

I just mean that you have an NVIDIA GPU in your PC. A discrete GPU is a PCIe card that you have plugged into the PC (whereas Jetson has an integrated GPU). In theory you could train on your PC with just CPU, but it may take longer.

oh ok yes I have a quadro P220 in the pc

Use the mobilenet-v1-ssd-mp-0_675.pth from the docs. The last I checked, the ONNX export was not working with mb2-ssd-lite either. MB1 and MB2 get similar performance on Jetson. The main thing with MB2 is separable convolutions for resource-limited mobile platforms (think phones), and the Jetson’s GPU has no problem with MB1.

I will do that as I don’t want to waste any more time than I have to thanks!

For the first time at least, I would follow how it is done in the tutorial (with ssd-mobilenet-v1), as I haven’t verified the other models with mb2 and VGG backbones are working later on in the pipeline (during the ONNX export and TensorRT import). That pytorch-ssd code was forked and I use it mainly for ssd-mobilenet-v1 because that seems to work well.

After you train your model, you can use the eval_ssd.py script from pytorch-ssd repo to test your output .pth checkpoint on a test image. This will confirm that the PyTorch model itself is good. Then you can convert it to ONNX.

Thanks a lot Dusty. I have tried with the model pth from the docs and I am still getting this output:

2021-04-16 16:03:22 - Start training from epoch 0.
C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\optim\lr_scheduler.py:136: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.
step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTo
rch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\nn\_reduction.py:44: UserWarning: size_average and reduce args will be deprecated, please use red
uction='sum' instead.
  warnings.warn(warning.format(ret))
2021-04-16 16:03:40 - Epoch: 0, Step: 10/147, Avg Loss: 8.6335, Avg Regression Loss 3.5030, Avg Classification Loss: 5.1305
Traceback (most recent call last):
  File "train_ssd.py", line 344, in <module>
    device=DEVICE, debug_steps=args.debug_steps, epoch=epoch)
  File "train_ssd.py", line 114, in train
    for i, data in enumerate(loader):
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataloader.py", line 435, in __next__
    data = self._next_data()
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\_utils\fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\torch\utils\data\dataset.py", line 218, in __getitem__
    return self.datasets[dataset_idx][sample_idx]
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\datasets\voc_dataset.py", line 69, in __getitem__
    image, boxes, labels = self.transform(image, boxes, labels)
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\ssd\data_preprocessing.py", line 34, in __call__
    return self.augment(img, boxes, labels)
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\transforms\transforms.py", line 55, in __call__
    img, boxes, labels = t(img, boxes, labels)
  File "F:\Work\Highlight\TF\jetson_dev\jetson-inference-master\python\training\detection\vision\transforms\transforms.py", line 280, in __call__
    if overlap.min() < min_iou and max_iou < overlap.max():
  File "C:\Users\muayt\.virtualenvs\jetson_dev-V1CBiXeI\lib\site-packages\numpy\core\_methods.py", line 34, in _amin
    return umr_minimum(a, axis, None, out, keepdims, initial, where)
ValueError: zero-size array to reduction operation minimum which has no identity

At least its working to an extent :p

It might be there is an annotation XML in your dataset that is malformed/corrupted or has no bounding boxes, or names a class that is not in labels.txt

What I recommend is uncommenting this line of code (this will print out image ID’s and bounding box info as it is loaded):

https://github.com/dusty-nv/pytorch-ssd/blob/e7b5af50a157c50d3bab8f55089ce57c2c812f37/vision/datasets/voc_dataset.py#L76

And then run train_ssd.py with --batch-size=1 --workers=1 --debug-steps=1

Then when the exception happens, look above to see the most recent image ID that was printed out.
Then inspect that image’s XML file to see what is different about it (or just remove it from your ImageSet lists)

I had a look and the image exists as does its corresponding xml file. I searched for the error on google and found a github conversation that I now cannot find again!! typical, but the person mentioned if they have one label and it is set to difficult then they can recreate the error every time. They said you need to set keep_difficult to true. I dont know where this needs to be set. And I cannot find the convo again to ask!

I checked the xml file and low and behold it had difficult in the xml. Do you know where I need to turn this keep_difficult setting on in the code?

Thanks :)

EDIT: I set difficult to 0 in the xml and it processed that file fine until next one with difficult. Is difficult setting really that big a impact on the training do you think? If not I can write a quick script to simply go through all xmls and change it to 0. But I don’t want to do that if it will really impact the training of hard to discern labels.

EDIT: found your answer to this here: https://forums.developer.nvidia.com/t/successful-training-with-train-ssd-py-using-small-custom-data-set-but-error-on-full-data-set/156921/6

it now trains thank you so much for your help, is there any info generated by this trainer that can be loaded into tensorboard or would I have to code that into it myself?

Once again, thank you!!

Edit: FYI for future reference I had to set num workers to 0 for it to run. Otherwise the multiprocessor error would happen.

OK gotcha, glad you were able to get it working by setting keep_difficult=True. Regarding tensorboard, you would would need to install/integrate tensorboard yourself (presumably using torch.utils.tensorboard). The training metrics that you would send to tensorboard would probably be the same ones that get printed out to the console here:

top notch! thank you!