Oh – they are really good at reading multi-column legal documents with clauses and hand written notes in the margin, and schedule tables. Quite amazing if you haven’t seen the output before. Qwen3.5 is excellent.
Thomas,
-
have you tested other models, particularly those in the ~30b MoE range, such as Gemma 4 and Qwen 3.5/6?
-
have you tried an approach involving post-OCR cleaning using even smaller models, e.g. G4 E4B?
-
have you tried dflash if it brings any speedup?
Models in the 30b range work incredibly fast – according to SparkArena, q3.6-35-int4 achieves over 90 tps single node. Might they also deliver in terms of quality? I’ve also started wondering how effective a two-stage processing approach might be, e.g. first MinerU at <1 second per page, followed by a quick pass with G4 E4B?
Either way, legal docs are a nightmare :)
Short answer – yes for 3.6 35b, but not extensively. We really can only support one multi-modal running on a single spark, so 122b is the best all-rounder at the moment. For client deployment I plan to test the smaller models, but I need to get things working reliably first with the best case. Then we can explore alternatives from that quality basis.
With MTP=3 we are getting 80% acceptance rates and pretty good performance so I am not unhappy with 122b for long running context and multi-modal.
A few updates after trialing lots of combos to hone in the pipeline using one difficult 29pg PDF (likely biasing my assessment).
- Qwen3.6-35b underperforms against 27b in consistency and output quality when making judgements on structure even though there were some speedups. This is for stitching pages together. I didn’t extensively test OCR conversion between the two.
- 27b int4 for C<=4 is faster, yes, but I was seeing higher throughput with C=10 → 20 using FP8. Getting up to 380 TPS (dual node setup).
- 27b int4 had more judgement output variability than FP8 for a fixed prompt.
- MTP > DFlash
- MTP=3 → 5 all have high acceptance rates. Still playing with this to see when the returns diminish.
- Should probably try this against 3.5-122b.
It might be worth looking into ParseBench to evaluate the results using a custom dataset:
I recently ran an interesting experiment using Hermes Agent and MinerU. In this setup, Hermes Agent runs on my PC and connects to Qwen3.5-122B, which is served by vLLM on a DGX Spark. MinerU runs separately on a PC equipped with an RTX 3060. Both the vLLM-Qwen service and the vLLM-MinerU service are deployed in Docker containers.
I asked Hermes Agent to convert PDF files into Markdown using MinerU. The workflow performed very well. Hermes automatically wrote a Python script and accessed MinerU through the Gradio API to complete the conversion.
Based on this experiment, I think a practical workflow for processing PDFs and images, then turning them into a knowledge-service solution, benefits from using two GPUs: one for the LLM/agent workload and another for document/image conversion.
I also tested Docling for PDF-to-Markdown conversion, but the output quality was not as good as MinerU in this use case.
One interesting side notes: vllm-MinerU in docker failed after 5 files, Hermes was smart enough to recognize it and install Mineru locally then move on
One more observation. Mineru generate better markdown result than ChatGPT
I have a concurrent workflow running in OpenCode now that is reliable. Working on the final generic extraction part at the moment. Basically you give it a todo list of fields to extract, with markdown prompts for each and it runs them in parallel. The final result provides a static web page where you can compare sources and results side by side for evaluation. This has been mostly a research project to figure out what inference patterns work, but has turned out to do most of what we need it to. Will share when this version is done.
