Any plans on adding siamese network in TLT?

I am looking for a siamese network based embedding model from TLT. Is there any plans on providing such model in the coming versions. Or is there any model which is already available in TLT or similar parallel tool.

Thanks for the idea. Will sync with internal team about your request.

1 Like

@Morganh you are awesome.

May I know more about your user case? More details is appreciated.

Sure @Morganh, First of all thank you very much for your response.

My use case is relatively general, I am working on a vision solution on an industrial environment where I need to different different customized packs from one another. The number of packages that I need to recognise/differentiate will be 1500+. So I can’t go with regular object detection model.

I can strongly believe, siamese models ideally address this issue. I feel adding these models on TLT, helps the developers to get a good heads up for such problems,

Thanks for the info.

1 Like

Did you ever refer to the paper of siamese network which is similarly talking about your user case? If yes, please the paper link.

I couldn’t able to find many papers which suits my requirement. Here are few resources that I have referred in this work.

From my experience, Ideally the end to end pipeline for such project involves multiple models. For your reference here I am sharing the pipeline as well. Addressing this pipeline end to end will helps many developers to implement the products quickly.

Object Detector → Feature Points for Object Alignment → Embeddings from Siamese Network → Similarity Matching.

Thanks for your info.


I see your requirement and use case. Could you please help to check whether our understanding is correct?

  1. run a detection model which detects only 1 class (assume you have 1000 customized packs to
    classify, so treat all 1000 objects as 1 class) →
  2. run a Siamese network which:
    assume you have a support set of images of 1000 classes
    for every image crops of objects within bounding boxes detected by detection model, you will run
    1000 time of Siamese network to classify, so:
    — assume one image has 5 objects, then Siamese network will run 5*1000 times of inferences to classify the 5 objects.
    — so in this case there will be a lot of inference burden and time needed. Is your application not real time?


Hi @Morganh

Thanks for the response.

  1. Your understanding is correct. As you mentioned we are planning to use a single class detection model to detect the objects from the image and then use siamese network to get embeddings and classification.

  2. Considering 1000 different support classes, I run the siamese model separately on those support class images and save their embeddings as a dictionary. Now

    • When I get a image of 5 objects, the detection model will ideally give 5 detections, and then feed those those 5 crops to the siamese model to get the 5 embeddings. In this stage we need to run inference 5 times for that image.

    • Now I use each embedding and try to classify that embedding against the precalculated embeddings.

    • My application needs to run in real-time.

If you have any other questions please let me know. I am interested to work with you on this work.


Thanks for the info. This feature request will be considered further but not expected to happen soon.

Thanks for the consideration Morgan. If it happens, please let me know. Looking forward to hearing from you.