PaddleOCR anyone?

Oh – they are really good at reading multi-column legal documents with clauses and hand written notes in the margin, and schedule tables. Quite amazing if you haven’t seen the output before. Qwen3.5 is excellent.

Thomas,

  1. have you tested other models, particularly those in the ~30b MoE range, such as Gemma 4 and Qwen 3.5/6?

  2. have you tried an approach involving post-OCR cleaning using even smaller models, e.g. G4 E4B?

  3. have you tried dflash if it brings any speedup?

Models in the 30b range work incredibly fast – according to SparkArena, q3.6-35-int4 achieves over 90 tps single node. Might they also deliver in terms of quality? I’ve also started wondering how effective a two-stage processing approach might be, e.g. first MinerU at <1 second per page, followed by a quick pass with G4 E4B?

Either way, legal docs are a nightmare :)

Short answer – yes for 3.6 35b, but not extensively. We really can only support one multi-modal running on a single spark, so 122b is the best all-rounder at the moment. For client deployment I plan to test the smaller models, but I need to get things working reliably first with the best case. Then we can explore alternatives from that quality basis.

With MTP=3 we are getting 80% acceptance rates and pretty good performance so I am not unhappy with 122b for long running context and multi-modal.

A few updates after trialing lots of combos to hone in the pipeline using one difficult 29pg PDF (likely biasing my assessment).

  • Qwen3.6-35b underperforms against 27b in consistency and output quality when making judgements on structure even though there were some speedups. This is for stitching pages together. I didn’t extensively test OCR conversion between the two.
  • 27b int4 for C<=4 is faster, yes, but I was seeing higher throughput with C=10 → 20 using FP8. Getting up to 380 TPS (dual node setup).
  • 27b int4 had more judgement output variability than FP8 for a fixed prompt.
  • MTP > DFlash
  • MTP=3 → 5 all have high acceptance rates. Still playing with this to see when the returns diminish.
  • Should probably try this against 3.5-122b.

It might be worth looking into ParseBench to evaluate the results using a custom dataset:

I recently ran an interesting experiment using Hermes Agent and MinerU. In this setup, Hermes Agent runs on my PC and connects to Qwen3.5-122B, which is served by vLLM on a DGX Spark. MinerU runs separately on a PC equipped with an RTX 3060. Both the vLLM-Qwen service and the vLLM-MinerU service are deployed in Docker containers.

I asked Hermes Agent to convert PDF files into Markdown using MinerU. The workflow performed very well. Hermes automatically wrote a Python script and accessed MinerU through the Gradio API to complete the conversion.

Based on this experiment, I think a practical workflow for processing PDFs and images, then turning them into a knowledge-service solution, benefits from using two GPUs: one for the LLM/agent workload and another for document/image conversion.

I also tested Docling for PDF-to-Markdown conversion, but the output quality was not as good as MinerU in this use case.

One interesting side notes: vllm-MinerU in docker failed after 5 files, Hermes was smart enough to recognize it and install Mineru locally then move on

One more observation. Mineru generate better markdown result than ChatGPT

I have a concurrent workflow running in OpenCode now that is reliable. Working on the final generic extraction part at the moment. Basically you give it a todo list of fields to extract, with markdown prompts for each and it runs them in parallel. The final result provides a static web page where you can compare sources and results side by side for evaluation. This has been mostly a research project to figure out what inference patterns work, but has turned out to do most of what we need it to. Will share when this version is done.