Data flywheels for llm-deflate?

I’m currently looking into extracting training data from open llms. eg in LLM-Deflate: Extracting LLMs Into Datasets**

My questions:**

  1. Are there any existing NVIDIA blueprints, frameworks, or tools that implement similar LLM knowledge extraction/decompression techniques?

  2. Does NVIDIA have any official guidance or best practices for systematically extracting training datasets from Nemotron models?

  3. Are there any NeMo framework components that could be leveraged for this type of hierarchical knowledge exploration and dataset generation?

  4. Has anyone in the community experimented with similar approaches using NVIDIA’s infrastructure (like NIM or TensorRT-LLM) for large-scale knowledge extraction?

Any insights, existing tools, or community projects in this direction would be greatly appreciated!

Hi @greg165,

We don’t have any content around carrying out extraction of training data from models. I’ve noted your interest here with the product team.

Best,

Sophie

1 Like

Thanks for checking.