Whats your VLM Experience?

Curious about everyone’s VLM experience on the DGX Spark. Right now I am using the Live VLM Playbook to simply identify if I have safety glasses on or off. It seems to be super inconsistent… Ive tried a couple of different models but none seem to hit home. Is it worthwhile to go down the rabbit hole of fine tuning AI models?

I use Qwen3-VL-30B-A3B-Instruct in a YouTube shorts pipeline I have to analyze my shorts for editing mistakes before posting and it performs great. I also use it to check for features in the videos/streams before clipping to make sure it contains what I’m looking for and it performs worse at that but still gets it right like 80% of the time.

What UI are you using?

I don’t run it through a UI, I feed the image/video into it using the openai python library (model is hosted on vllm) and have qwen output the information I want in a json structure which it does sometimes mess up because it’s not very smart in that way but nothing a regen can’t fix. I’m sure you could vibe code up a nice looking UI to interact with it in this way.