95%的人还在手动提取数据，用这个工具秒变结构化- 大数跨境

首页

95%的人还在手动提取数据，用这个工具秒变结构化

机器学习AI算法工程

2026-03-18

向AI转型的程序员都关注公众号机器学习AI算法工程

你每天都要处理各种"乱七八糟"的文本：保险公司发来的邮件东一句西一句、房产中介的listing格式五花八门、医生手写的处方扫描件歪歪扭扭……

想从中抠出关键信息，比如保单号、房价、用药剂量，往往得靠手动复制粘贴，或者写一堆规则去匹配。结果一换格式，规则全崩。

我见过太多人把大把时间浪费在这上面。其实，用对工具，几行代码就能搞定。

今天介绍Google最新开源的LangExtract，一个基于大语言模型的智能信息提取库。它专门解决"非结构化文本→结构化数据"的最后一公里问题。

看完这篇文章，你将掌握：

LangExtract的核心优势和适用场景
3步完成环境配置和基础使用
如何处理长文档并生成交互式可视化
实战案例：从混乱文本到完美JSON

一、LangExtract是什么？

工作流程

简单说，LangExtract是一个Python库，用LLM（如Gemini、GPT）把非结构化文本转成结构化信息，而且每条结果都能精确定位到原文位置。

它和传统工具最大的区别是什么？

传统工具（如基于模板的提取器或纯OCR）假设文档有固定布局。保险公司的报价单A公司是表格，B公司是纯文本；医生的处方有的打印，有的潦草手写。一变格式，规则全废，得重新写。

OCR能把图片转文本，但转出来的是"脏"数据：错字、换行乱、入错列。想再结构化，往往还得加一堆正则、关键词匹配，维护成本爆炸。

LangExtract直接用大模型"读懂"文本含义，而不是死盯位置和格式。它有6大核心优势：

1. 精准溯源：每条结果都能对号入座

这是LangExtract最核心的竞争力。它为每个提取出的实体、关系或结构化字段，自动标注其在原文本中的起止位置（行号、字符偏移量），并关联上下文片段。

这意味着提取结果可直接回溯到原文验证，完全解决了LLM提取"黑盒"问题。我在医疗病历提取测试中，提取患者的"既往病史"字段后，LangExtract直接标注了该信息来自病历第12行"既往有高血压病史5年，规律服药"，后续审核人员可快速核对。

2. 少样本定义：1-2条示例即可定义格式

LangExtract支持通过少量示例（Few-shot）定义输出格式，无需编写复杂的Prompt，也无需微调模型。只需提供1-2条文本及其对应的结构化结果，工具就能自动学习格式要求，适配特定领域需求。

例如在法律案例提取中，我仅提供了一条案例的"案件编号、原告、被告、判决结果"结构化示例，LangExtract就能自动按照该格式，提取其他案例的对应信息。同时支持通过控制生成技术（如Gemini的结构化输出能力），强制结果符合JSON、CSV等格式，避免LLM输出格式混乱。

3. 长文档优化：智能分块+并行处理

针对长文档处理，LangExtract内置了智能分块策略：根据文本语义（段落、章节）自动拆分文档，确保每一块的信息完整性，同时避免跨越语义边界导致的提取错误。拆分后采用并行处理模式，大幅提升长文档的提取效率。

实测处理一篇50页的科研论文（约2万字），LangExtract自动拆分为12个语义块，并行调用Gemini模型提取核心观点，全程耗时仅8分钟，比手动分块处理快了3倍以上，且提取结果无遗漏。

4. 交互式可视化：HTML自动生成

提取完成后，LangExtract会自动生成交互式HTML报告，包含原文本、提取的结构化结果、每条结果的溯源信息及上下文。支持高亮显示提取实体、筛选字段、跳转查看原文位置，非常适合团队协作中的结果审核与调试。

这种可视化能力，让非技术背景的业务人员（如医生、律师）也能参与到结果校验中，降低了跨角色协作的门槛。

5. 多模型支持：云端+本地灵活切换

LangExtract不绑定特定LLM，支持灵活集成各类模型：既可以调用谷歌Gemini、OpenAI GPT等云端模型，也能接入本地部署的模型（如通过Ollama部署的Llama 3、Mistral）。这对于处理敏感数据（如医疗病历、涉密法律文档）的场景至关重要——可完全在本地运行，避免数据外泄。

6. 零微调适配：任何领域拿来即用

得益于少样本学习与LLM的通用语言理解能力，LangExtract无需对模型进行领域微调，仅通过示例和简单配置，就能快速适配医疗、法律、金融等不同场景。这对于资源有限的团队或科研人员来说，极大降低了使用门槛，真正实现"拿来即用"。

二、安装配置：3步搞定

安装超级简单，推荐使用Python 3.10版本。

  # 方式1：直接安装（推荐） pip install langextract # 方式2：清华源加速（国内用户） pip install -i https://pypi.tuna.tsinghua.edu.cn/simple langextract # 方式3：开发模式（可修改源码） git clone https://github.com/google/langextract.git cd langextract pip install -e .
 

配置API密钥

默认使用Google Gemini。你需要从Google AI Studio获取Key（免费）。

  # 方式1：环境变量（Linux/Mac） export LANGEXTRACT_API_KEY="your-api-key-here" # 方式2：.env文件（推荐） cat >> .env << 'EOF' LANGEXTRACT_API_KEY=your-api-key-here EOF echo '.env' >> .gitignore # 保护密钥安全
 

如果你用本地模型（如Ollama），无需配置API密钥，确保Ollama服务已启动：

  # 安装Ollama（macOS） brew install ollama # 启动服务 ollama serve # 拉取模型（另一个终端） ollama pull gemma2:2b
 

三、基础使用：提取文本信息

最核心的是`lx.extract()`函数。你需要：

1. 定义抽取任务描述（Prompt）

2. 提供示例（Few-shot Examples）

3. 调用抽取接口

  import langextract as lx # 1. 定义提取任务描述 prompt = """ Extract characters, emotions, and relationships in order of appearance. Use exact text for extractions. Do not paraphrase or overlap entities. """ # 2. 提供示例（Few-shot） examples = [ lx.data.ExampleData( text="ROMEO: But soft! What light through yonder window breaks?", extractions=[ lx.data.Extraction( extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"} ), lx.data.Extraction( extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"} ) ] ) ] # 3. 调用提取接口 input_text = "Juliet gazed at stars, her heart longing for Romeo." result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gemini-2.5-flash" # 推荐默认模型 ) # 4. 输出结果 print(f"Extracted {len(result.extractions)} entities:") for entity in result.extractions: print(f"• {entity.extraction_class}: {entity.extraction_text}") if entity.char_interval: pos = entity.char_interval print(f" Position: {pos.start_pos}-{pos.end_pos}")
 

  Extracted 2 entities: • character: Juliet Position: 0-6 • emotion: longing for Romeo Position: 31-48
 

`result`是一个`AnnotatedDocument`对象，包含所有抽取的实体、属性、类别，且每个抽取都指向原文位置。

关键参数说明

- `model_id`：选择模型

- `gemini-2.5-flash`：推荐默认，速度快、成本低、质量好

- `gemini-2.5-pro`：复杂任务需要深度推理时使用

- `gpt-4o`：使用OpenAI模型（需额外配置）

- `gemma2:2b`：本地Ollama模型（需先启动Ollama）

- `text_or_documents`：输入文本

- 直接传入字符串

- 传入URL（自动下载）

- 传入本地文件路径

四、输出保存与可视化

LangExtract可以将抽取结果保存为`.jsonl`，并自动生成交互式HTML报告：

  # 保存结果为JSONL lx.io.save_annotated_documents( [result], output_name="results.jsonl", output_dir="." ) # 生成交互式HTML可视化 html_content = lx.visualize("results.jsonl") with open("visualization.html", "w") as f: if hasattr(html_content, 'data'): f.write(html_content.data) # Jupyter/Colab环境 else: f.write(html_content) # 普通环境
 

打开`visualization.html`即可查看每条实体在原文中的位置高亮显示，支持交互式探索。

五、处理长文档：并行+多轮提取

LangExtract针对长文档做了深度优化，支持并行处理和多轮提取。

  # 处理整本《罗密欧与朱丽叶》（14.7万字符） result = lx.extract( text_or_documents="https://www.gutenberg.org/files/1513/1513-0.txt", prompt_description=prompt, examples=examples, model_id="gemini-2.5-flash", # 长文档优化参数 extraction_passes=3, # 多轮提取，提高召回率 max_workers=20, # 并行处理，提升速度 max_char_buffer=1000 # 每块字符数，越小精度越高 )
 

关键参数：

- `extraction_passes`：多轮提取

- LLM有随机性，单次可能漏掉某些实体

- 多轮独立提取后合并，提升召回率

- 合并策略：第一轮优先，后续只添加不重叠的新实体

- `max_workers`：并行处理数

- 同时调用多个LLM请求

- 根据API限流调整（Gemini Tier 2支持更高并发）

- `max_char_buffer`：分块大小

- 每个chunk的字符数

- 越小精度越高，但请求次数越多

- 推荐值：500-2000

实测效果：

- 输入：14.7万字符（整本《罗密欧与朱丽叶》）

- 输出：4,088个实体（角色、情感、关系）

- 耗时：约17秒（20 workers）

- 召回率：显著高于单次提取

六、使用其他模型

LangExtract支持多种模型提供者，根据成本、隐私、性能灵活选择。

使用OpenAI模型

  import os import langextract as lx result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gpt-4o", # 自动选择OpenAI provider api_key=os.environ.get('OPENAI_API_KEY'), fence_output=True, # OpenAI必须设为True use_schema_constraints=False # OpenAI必须设为False )
 

注意：OpenAI模型需要`fence_output=True`和`use_schema_constraints=False`，因为LangExtract尚未为OpenAI实现schema约束。

使用本地Ollama模型

  result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gemma2:2b", # 自动选择Ollama provider model_url="http://localhost:11434", fence_output=False, use_schema_constraints=False )
 

完全免费，数据不出本地，适合处理敏感信息（如医疗病历、涉密文档）。但需要硬件支持，8B模型至少需要16GB内存。

使用国产大模型

LangExtract支持所有兼容OpenAI协议的国产大模型：

  # DeepSeek V3/R1 result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="deepseek-chat", api_key="your-api-key", language_model_params={ "base_url": "https://api.deepseek.com/v1" }, fence_output=True, use_schema_constraints=False ) # 阿里通义千问 result = lx.extract( text_or_documents=text, prompt_description=prompt, examples=examples, model_id="qwen-turbo", # 或 qwen-plus, qwen-max api_key="sk-...", # 阿里云DashScope API Key language_model_params={ "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1" }, fence_output=True, use_schema_constraints=False )
 

已实测支持：DeepSeek（V3, R1）、字节豆包、阿里千问、智谱GLM-4、MiniMax等。

七、实战案例：从混乱文本到结构化JSON

代码界面

让我们通过一个真实案例，完整演示从混乱文本到完美JSON的转换过程。

场景：医疗病历信息提取

假设我们有一段混乱的医疗笔记，需要提取患者信息、用药记录、诊断结果等结构化数据。

原始文本：

  Patient: John Smith, Age: 45, Gender: Male Visit Date: 2025-03-15 Chief Complaint: Persistent cough for 3 weeks History of Present Illness: Patient reports productive cough with yellow sputum for 3 weeks. No fever, no chest pain. History of hypertension for 5 years. Physical Exam: BP: 135/85 mmHg, HR: 78 bpm, Temp: 37.2°C Lungs: Clear to auscultation bilaterally Diagnosis: 1. Acute bronchitis 2. Hypertension, well-controlled Treatment Plan: Azithromycin 500mg PO once daily for 5 days Lisinopril 10mg PO daily (continue) Follow-up in 1 week if symptoms persist
 

目标输出：提取为结构化JSON，包含患者信息、诊断、用药等。

第一步：定义提取规则

  import langextract as lx prompt = """ Extract patient demographics, diagnoses, and medications in order of appearance. Use exact text for extractions. Group related information using attributes. """ # 提供高质量示例 examples = [ lx.data.ExampleData( text="Patient: Jane Doe, Age: 32. Diagnosis: Diabetes Mellitus. Medication: Metformin 500mg twice daily.", extractions=[ # 患者信息 lx.data.Extraction( extraction_class="patient_name", extraction_text="Jane Doe", attributes={"info_type": "demographics"} ), lx.data.Extraction( extraction_class="age", extraction_text="32", attributes={"info_type": "demographics"} ), # 诊断 lx.data.Extraction( extraction_class="diagnosis", extraction_text="Diabetes Mellitus", attributes={"info_type": "diagnosis"} ), # 用药 lx.data.Extraction( extraction_class="medication", extraction_text="Metformin 500mg twice daily", attributes={ "info_type": "medication", "medication_name": "Metformin" } ) ] ) ]
 

第二步：执行提取

  input_text = """ Patient: John Smith, Age: 45, Gender: Male Visit Date: 2025-03-15 Chief Complaint: Persistent cough for 3 weeks History of Present Illness: Patient reports productive cough with yellow sputum for 3 weeks. No fever, no chest pain. History of hypertension for 5 years. Physical Exam: BP: 135/85 mmHg, HR: 78 bpm, Temp: 37.2°C Lungs: Clear to auscultation bilaterally Diagnosis: 1. Acute bronchitis 2. Hypertension, well-controlled Treatment Plan: Azithromycin 500mg PO once daily for 5 days Lisinopril 10mg PO daily (continue) Follow-up in 1 week if symptoms persist """ result = lx.extract( text_or_documents=input_text, prompt_description=prompt, examples=examples, model_id="gemini-2.5-flash" )
 

第三步：处理结果

  from collections import defaultdict # 按信息类型分组 structured_data = { "demographics": {}, "diagnoses": [], "medications": [] } demographics = {} diagnoses = [] medications = [] for extraction in result.extractions: entity_class = extraction.extraction_class entity_text = extraction.extraction_text if entity_class == "patient_name": demographics["name"] = entity_text elif entity_class == "age": demographics["age"] = entity_text elif entity_class == "diagnosis": diagnoses.append(entity_text) elif entity_class == "medication": medications.append(extraction.text) structured_data["demographics"] = demographics structured_data["diagnoses"] = diagnoses structured_data["medications"] = medications # 输出JSON import json print(json.dumps(structured_data, indent=2, ensure_ascii=False))
 

  { "demographics": { "name": "John Smith", "age": "45" }, "diagnoses": [ "Acute bronchitis", "Hypertension, well-controlled" ], "medications": [ "Azithromycin 500mg PO once daily for 5 days", "Lisinopril 10mg PO daily (continue)" ] }
 

第四步：生成交互式可视化

  # 保存结果 lx.io.save_annotated_documents( [result], output_name="medical_record_extraction.jsonl", output_dir="." ) # 生成可视化 html_content = lx.visualize("medical_record_extraction.jsonl") with open("medical_viz.html", "w") as f: if hasattr(html_content, 'data'): f.write(html_content.data) else: f.write(html_content) print("✓ Visualization saved to medical_viz.html")
 

打开`medical_viz.html`，你可以：

- 看到每个提取字段在原文中的精确位置（高亮显示）

- 点击实体查看详细信息（位置、属性、上下文）

- 筛选特定类型的实体（只看诊断、只看用药）

- 导出为其他格式（CSV、JSON）

完整代码：

#!/usr/bin/env python3# -*- coding: utf-8 -*-"""LangExtract 实战案例：医疗病历信息提取从混乱文本到结构化JSON的完整演示"""import langextract as lxfrom collections import defaultdictimport json# ============================================# 第一步：定义提取规则# ============================================prompt = """Extract patient demographics, diagnoses, and medications in order of appearance.Use exact text for extractions. Group related information using attributes."""# 提供高质量示例examples = [    lx.data.ExampleData(        text="Patient: Jane Doe, Age: 32. Diagnosis: Diabetes Mellitus. Medication: Metformin 500mg twice daily.",        extractions=[            # 患者信息            lx.data.Extraction(                extraction_class="patient_name",                extraction_text="Jane Doe",                attributes={"info_type": "demographics"}            ),            lx.data.Extraction(                extraction_class="age",                extraction_text="32",                attributes={"info_type": "demographics"}            ),            # 诊断            lx.data.Extraction(                extraction_class="diagnosis",                extraction_text="Diabetes Mellitus",                attributes={"info_type": "diagnosis"}            ),            # 用药            lx.data.Extraction(                extraction_class="medication",                extraction_text="Metformin 500mg twice daily",                attributes={                    "info_type": "medication",                    "medication_name": "Metformin"                }            )        ]    )]# ============================================# 第二步：执行提取# ============================================input_text = """Patient: John Smith, Age: 45, Gender: MaleVisit Date: 2025-03-15Chief Complaint: Persistent cough for 3 weeksHistory of Present Illness:Patient reports productive cough with yellow sputum for 3 weeks.No fever, no chest pain. History of hypertension for 5 years.Physical Exam:BP: 135/85 mmHg, HR: 78 bpm, Temp: 37.2°CLungs: Clear to auscultation bilaterallyDiagnosis:1. Acute bronchitis2. Hypertension, well-controlledTreatment Plan:Azithromycin 500mg PO once daily for 5 daysLisinopril 10mg PO daily (continue)Follow-up in 1 week if symptoms persist"""print("开始提取病历信息...")result = lx.extract(    text_or_documents=input_text,    prompt_description=prompt,    examples=examples,    model_id="gemini-2.5-flash")print(f"✓ 提取完成，共找到 {len(result.extractions)} 个实体\n")# ============================================# 第三步：处理结果# ============================================# 按信息类型分组structured_data = {    "demographics": {},    "diagnoses": [],    "medications": []}demographics = {}diagnoses = []medications = []for extraction in result.extractions:    entity_class = extraction.extraction_class    entity_text = extraction.extraction_text    if entity_class == "patient_name":        demographics["name"] = entity_text    elif entity_class == "age":        demographics["age"] = entity_text    elif entity_class == "diagnosis":        diagnoses.append(entity_text)    elif entity_class == "medication":        medications.append(extraction.text)structured_data["demographics"] = demographicsstructured_data["diagnoses"] = diagnosesstructured_data["medications"] = medications# 输出JSONprint("=" * 50)print("提取结果（JSON格式）：")print("=" * 50)print(json.dumps(structured_data, indent=2, ensure_ascii=False))# ============================================# 第四步：生成交互式可视化# ============================================print("\n" + "=" * 50)print("生成交互式可视化报告...")print("=" * 50)# 保存结果lx.io.save_annotated_documents(    [result],    output_name="medical_record_extraction.jsonl",    output_dir=".")# 生成可视化html_content = lx.visualize("medical_record_extraction.jsonl")with open("medical_viz.html", "w", encoding="utf-8") as f:    if hasattr(html_content, 'data'):        f.write(html_content.data)    else:        f.write(html_content)print("✓ 可视化报告已保存到 medical_viz.html")print("\n打开 medical_viz.html 可查看：")print("  - 每个提取字段在原文中的精确位置（高亮显示）")print("  - 点击实体查看详细信息（位置、属性、上下文）")print("  - 筛选特定类型的实体（只看诊断、只看用药）")print("  - 导出为其他格式（CSV、JSON）")

常见问题与避坑指南

Q1: 模型选择建议？

云端模型（Gemini、GPT）：

- 优点：准确率高，适合复杂场景

- 缺点：有API调用成本，数据需要上传云端

- 适用：公开数据、复杂推理、生产环境

本地模型（Ollama）：

- 优点：完全免费，数据不出本地

- 缺点：需要硬件支持，准确率略低

- 适用：敏感数据（医疗、涉密）、测试验证

推荐：

- 开发测试：`gemini-2.5-flash`（速度快、成本低）

- 复杂任务：`gemini-2.5-pro`（推理能力强）

- 敏感数据：本地Ollama模型（gemma2:2b）

Q2: 少样本示例如何设计？

关键原则：

1. 示例需覆盖核心字段（不要遗漏重要类型）

2. 文本风格尽量与待提取文本一致

3. 提取文本必须是原文的精确复制（不能改写）

4. 实体按出现顺序排列

5. 属性要有意义（帮助理解上下文）

示例数量：

- 简单任务：1-2个示例足够

- 复杂任务：2-3个示例更稳定

- 过多示例：可能增加成本，提升有限

Q3: 如何提升提取准确率？

1. 优化提示词

  # ❌ 太模糊 prompt = "Extract information from text." # ✅ 清晰具体 prompt = """ Extract patient demographics (name, age, gender), diagnoses, and medications in order of appearance. Use exact text from input for extraction_text. Group related medications using 'medication_group' attribute. """
 

2. 提供高质量示例

- 示例文本要有代表性

- 提取结果要准确完整

- 属性设计要有意义

3. 使用多轮提取

  result = lx.extract( ..., extraction_passes=3 # 提升召回率 )
 

4. 调整分块大小

  result = lx.extract( ..., max_char_buffer=1000 # 太大可能遗漏，太小增加成本 )
 

Q4: 性能优化建议？

长文档处理：

- 使用`extraction_passes=2-3`提升召回率

- 根据API限流调整`max_workers`

- `max_char_buffer`推荐500-2000

成本控制：

- 开发用`gemini-2.5-flash`（便宜）

- 生产启用Vertex AI Batch API（省50%成本）

- 本地模型免费但需要硬件

速度优化：

- 增加并行数（`max_workers=20-50`）

- 减少分块（`max_char_buffer=2000`）

- 使用`gemini-2.5-flash`（比pro快2-3倍）

Q5: 与其他工具如何对比？

维度	LangExtract	spaCy	LangChain	Docling
核心能力	LLM驱动提取	传统NLP	LLM编排	文档解析
领域适配	零微调，示例驱动	需训练模型	需编写Chain	不涉及
长文本	原生优化	无优化	需手动分块	不涉及
结果溯源	原生支持	无	需自定义	不涉及
可视化	内置HTML	无	无	无
学习曲线	低（示例驱动）	中（需NLP知识）	中（需编程）	低
最佳场景	领域定制提取	通用NLP任务	多步骤流程	PDF转文本