Hey folks. How do you think this can be created?
So here’s the idea. I’m building a tool that takes video content — lectures, tutorials, demonstrations, whatever — and creates a narrowly specialized agent for a specific task. Not a general-purpose chatbot, but a genuinely focused expert. Like: you feed it videos of welding masterclasses, describe the specialty you want, and you get a welding assistant agent that actually understands those specific techniques, those specific moves.
The architecture I’m looking at right now goes like this. First, a human marks up the video at the initial stage. Places markers: here’s a key concept, here’s the core of the technique, this part is secondary. This matters because the content is complex and without a human eye the machine won’t know what to actually pay attention to. Then an agent-tokenizer steps in. It looks at the video through those markers plus the specialty of the agent we want to build, and produces semantic tokens. Not just text, but task-specific units of meaning. Same video with a different specialty produces different tokens. For a tutor agent — one set. For a fact-checker agent — a completely different set. Then a builder-agent assembles the final specialized agent from those tokens.
And this is where I’m stuck on a few things, honestly not sure which way to go.
First: the marker system. How would you even organize it? Just timestamps with text labels or something more sophisticated? Maybe some kind of hierarchy? Or are there existing approaches from video annotation that I just don’t know about?
Second: the tokenizer itself. Does it make sense to build it as a separate agent with a prompt that dynamically decides what to extract? Or is it more reliable to build a pipeline: CLIP for visuals, an LLM for text, and a custom layer on top that stitches everything together based on the specialty? Which would be more flexible and less likely to fall apart on real data?
Third: how to pass the target agent specialty to the tokenizer? Just a natural language description or something more formal? Maybe an ontology or a graph? Or is plain language good enough and I shouldn’t overcomplicate it?
Fourth: is it worth baking in a two-agent loop from the start — a generator builds, a verifier checks and sends back for refinement? Or is that overengineering at this stage?
And the big one: where does this break first? I’ve got a gut feeling there are pitfalls I’m not seeing yet. Maybe some of you have already stepped on these rakes.
Anyway, would really appreciate any thoughts, especially from people who’ve touched similar architectures — multimodal pipelines, video tokenization, task-specific agent assembly. Thanks in advance.