最近又新增了很多文档解析的开源项目,现再更新一下进展。里面提到的很多模型技术方案都在《文档智能专栏》
OCR-Pipline式文档解析(layout+阅读顺序+ocr专家小模型)
-
MinerU1.x: https://github.com/opendatalab/MinerU -
ppstructure: https://github.com/PaddlePaddle/PaddleOCR/blob/main/docs/version3.x/algorithm/PP-StructureV3/PP-StructureV3.md -
Docling: https://github.com/docling-project/docling -
Marker: https://github.com/VikParuchuri/marker
...
总结:ocr-pipline的可解释性强,更贴近落地解法,但泛化能力有限
Layout+VLM
-
MinerU2.5(1.2B): https://github.com/opendatalab/MinerU -
MonkeyOCR(1.2B~3B):https://github.com/Yuliang-Liu/MonkeyOCR -
PaddleOCR-VL(0.9B):https://github.com/PaddlePaddle/PaddleOCR -
chandra(8B):https://github.com/datalab-to/chandra
这里面有些是传统的目标检测模型+VLM解析各部分内容,有些是检测+识别都一个模型干了。
多模态端到端的文档解析(finetune)
-
Dolphin: https://github.com/bytedance/Dolphin -
olmOCR: https://github.com/allenai/olmocr -
GOT-OCR: https://github.com/Ucas-HaoranWei/GOT-OCR2.0 -
SmolDocling: https://huggingface.co/ds4sd/SmolDocling-256M-preview -
Unstructured: https://github.com/Unstructured-IO/unstructured -
OpenParse: https://github.com/Filimoa/open-parse -
Mistral-OCR: https://mistral.ai/news/mistral-ocr?utm_source=ai-bot.cn -
Nougat: https://github.com/facebookresearch/nougat -
DeepSeek-OCR:https://github.com/deepseek-ai/DeepSeek-OCR
...
通用多模态大模型代表
-
GPT4o -
Gemini -
Qwen2.5-VL-72B
...

