I’m currently looking into extracting training data from open llms. eg in LLM-Deflate: Extracting LLMs Into Datasets**
My questions:**
-
Are there any existing NVIDIA blueprints, frameworks, or tools that implement similar LLM knowledge extraction/decompression techniques?
-
Does NVIDIA have any official guidance or best practices for systematically extracting training datasets from Nemotron models?
-
Are there any NeMo framework components that could be leveraged for this type of hierarchical knowledge exploration and dataset generation?
-
Has anyone in the community experimented with similar approaches using NVIDIA’s infrastructure (like NIM or TensorRT-LLM) for large-scale knowledge extraction?
Any insights, existing tools, or community projects in this direction would be greatly appreciated!