Originally published at: Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time | NVIDIA Technical Blog
We keep seeing LLMs with larger context windows in the news, along with promises that they can hold entire conversation histories, volumes of books, or multiple codebases in view at once. And yet, these models still repeat the same mistakes. We still have to copy and paste the earlier context back into the chat for…
Answering to context window information in production reality (the basic scenario: I conversate with a chatbot) requires speed - how fast is this TTT process that you propose? Can it happen “in the background”, during my conversation - let’s assume, the chat grows by a few pages every minute, during one hour-long session?
This is great. I’ve repeatedly seen this in enterprise deployments where users keep asking the same question over and over and we keep sending in the same context over and over. This would mean that the model can actually answer questions without needing to retrieve, unless the question was new.
One angle you didn’t write about but would be crucial for enterprise deployments: would it be possible to “draw” tenant boundaries within the model so learnings and data from one customer don’t leak into the answers we give to another customer, while still distilling common patterns across customers into the model’s weights?