Using Stanford Drone Dataset for TLT Training

Hi,
In this webinar slide I have found out that for TLT transfer learning each frame in Stanford Drone Dataset should be converted to the size 768x768 however Kitti dataset includes images with resolutions 1392x512. Do I need to explicitly convert frames to 768x768 before training or does TLT converts them automatically before training?

What’s more, I have found a github repo which converts Stanford Drone Dataset Videos to Frames and also converts annotations to Kitti format. But there are some oddities such as frame-annotation anomalies. First anomaly is the +1 annotation amount;

images => bookstore/video0 => 13334 frames --- annotations => 13335 annotations, images => bookstore/video2 => 14557 frames --- annotations => 14558 annotations, images => bookstore/video3 => 14557 frames --- annotations => 14558 annotations

second anomaly;
images => nexus/video5 => 1061 frames --- annotations => nexus/video5 => 562 annotations

Is there an official Dataset format converter script for this specific case?
Thanks in advance :)

  1. See https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#requirements
    It is not a must to resize to 768x768, but need to follow below requirement.
    If train with detectnet_v2,

W > =480, H >=272 and W, H are multiples of 16

If train with FasterRCNN,

W > =160; H >=160

other detection network

W >= 128, H >= 128, W, H are multiples of 32

  1. Need to resize your labels/images offline.

Note: The tlt-train tool does not support training on images of multiple resolutions, or resizing images during training. All of the images must be resized offline to the final training size and the corresponding bounding boxes must be scaled accordingly.

  1. There is not official script to generate KITTI format labels. You can analyze the output of your mentioned github and try to convert.
1 Like