Deepstream + VLM

Did anyone try to feed multiple camera inputs from deepstream to a vlm? Or is this possible somehow? I mean if i have a image batch from different camera angle, I can analyze the whole image batch for more context? I tried this with Chat GPT and works. Any ideas are welcome, I’m curious about this.

@raresracoar haven’t tried this particular scenario, but VILA-1.5 handles multiple images, so you could try feeding it the different image inputs and see the response. My testing so far with that is on video sequences from the same camera stream, but sounds interesting. These aren’t “batched” in the traditional sense where they are independent queries, the images would all still be added to the same chat query if you wanted the model to analyze them together.

1 Like

@dusty_nv I think this can be differentiate in three ways to give context to VML:

  1. Temporal context when analyzing the same camera stream
  2. Spatial context when analyzing from multiple angles (we assume that they overlap or we can have some config) in the same time
  3. The combination of two from above
    Can you provide some details about VILA-1.5?
    @raresracoar do you have in plan to feed VILA-1.5 direct with images or with some metadata from infer?

@dusty_nv can be a way to fine tune the vlm with some “map configuration” for environment and analyze the streams as a whole? This could be very interesting.

Hi @alexaaniel, I haven’t tried it onboard Jetson, but the upstream Llava and VILA repos include the fine-tuning scripts (in addition to projects like llama-factory)

1 Like