Deep Learning for Object Detection with DIGITS

Originally published at:

Figure 1: A screenshot of DIGITS 4 showing the input image (top) and the final result with bounding boxes around detected vehicles (bottom). Today we’re excited to announce the availability of NVIDIA DIGITS 4. DIGITS 4 introduces a new object detection workflow and DetectNet, a new deep neural network for object detection that enables data scientists…

How can we creat label discription for object tetection?

Is't possible to train faster r-cnn or single shot detector models now or in close future?

Good post!! The goal of an object detection system is to detect all instances of objects of a known category in an image. Figure 1 shows the final results of an object detection system trained with DIGITS which can detect vehicles on a construction site. Starting with a successful vehicle detection system like this, you can solve a number of other problems such as recognizing the makes and models of the vehicles, counting and tracking vehicle locations over time, generating natural language descriptions of the images and so on.


Indeed an interesting tool with a simple interface. I wonder how easy is it to change the network architecture?

Changing the GoogleNet FCN portion of DetectNet is relatively straightforward. You just have to make sure that you reconcile the input and output data shapes with whatever FCN you replace it with.

In addition to Object Detection networks DIGITS 4 also supports image segmentation networks and image classification networks. Network architectures can be modified using the Caffe prototxt format.

It is not possible to train faster r-cnn through DIGITS currently. DetectNet is a single shot detector. It should be possible to also implement other single shot detectors like YOLO but there is not currently an example of this.

There are a variety of tools available for creating bounding box annotations on images that can be converted in to the KITTI format that DetectNet requires. One example is the open-source tool Sloth:

Do we have support for multiple object detection(pedestrians, cars in an image) in DIGITS 4?
If not, is it planned for the future?

You can and here is a PR with some information about it -

"Unfortunately, this dataset is not shareable" - maybe the net weights are shareable? The data look the same as in my problem (cars viewed above), would love to try it out on my dataset :)

try "ALT" from but no guarantee its the easiest way. Prefer considering it "easier".

Im pretty sure that recall is Tp / (Tp + Fn), not Tp / (Tp + Tn)

Good catch. Fixed it, thanks.

Can we use torch model instead of caffe in custom network option? can we convert caffe prototxt to torch lua?

I have very large images at about 5000x4000 with training bounding boxes that are mostly about 110x110. There are more than 500 of these images. Is there any advice with dealing with this much data or estimation of how long it could take to train? I am using a Tesla K40. Any idea what batch size I will likely have to use or even a reference as to how to determine the batch size?

Hi Leonard, for input image size, you are restricted to gpu memory. I am training large images also and for detectnet, at 12 GB gpu RAM, 4-4.5 megapixel RGB images are maximum. This means something like 2000x2000 or 4000X1000, etc.
Have you checked the tools at ? You may find some useful stuff there.

Thank you Baker and Prasanna for step by step guidance
I am new bee here and I followed all your steps to create DB using KITTI vision with 56 sets for training and 13 for validation

I got error when created DB and I really dont know what the error msg means

2017-05-11 09:39:43 [ERROR] ValueError: invalid literal for int() with base 10: '116.41870117188'
Traceback (most recent call last):
File "/home/yasirac/digits/digits/tools/", line 478, in <module>
File "/home/yasirac/digits/digits/tools/", line 443, in create_generic_db
File "/home/yasirac/digits/digits/tools/", line 296, in create_db
entry_ids = extension.itemize_entries(stage)
File "/home/yasirac/digits/digits/extensions/data/objectDetection/", line 183, in itemize_entries
File "/home/yasirac/digits/digits/extensions/data/objectDetection/", line 208, in load_ground_truth
File "/home/yasirac/digits/digits/extensions/data/objectDetection/", line 193, in load_gt_obj
gt.occlusion = int(row[2])
ValueError: invalid literal for int() with base 10: '116.41870117188'

I want to know the performance impact in object detection if the image resolution is high.

classification + localization = object detection. Am I right ?