

GraphRAG 分析，第 2 部分：图形创建和检索与矢量数据库检索

Super Intelligence

2024-10-28

由于 Microsoft 的 GraphRAG 论文只显示了模糊定义的提升，我发现 GraphRAG 提高了忠实度，但没有其他 RAGAS 指标 - 知识图谱的 ROI 可能无法证明炒作的合理性。

与基于向量的 RAG 相比，GraphRAG（通过 Cypher 在 Neo4j 中完全创建和检索时）增强了忠实度（类似于精度的 RAGAS 指标 - 例如，它是否准确反映了 RAG 文档中的信息），但不会影响其他 RAGAS 指标。考虑到性能开销，它可能无法提供足够的 ROI 来证明对准确性优势的炒作是合理的。

影响（见文章底部的本分析中的潜在偏差列表）：

提高准确性： GraphRAG 可能适用于需要高精度的领域，例如医疗或法律应用。
复杂关系：它可能在涉及复杂实体关系的场景中表现出色，例如分析社交网络或供应链。
权衡：提高忠实度的代价是知识图谱设置和维护的复杂性增加，因此炒作可能没有道理。

介绍：

这篇文章是 GraphRAG 分析第 1 部分的后续，该部分对拜登和特朗普之间的美国总统辩论记录（截至本博客文章，该文档不在任何模型的训练数据中）执行 RAG，将 Neo4j（图形数据库）的向量数据库与 FAISS（非图形数据库）的向量数据库进行比较。这允许对数据库进行清晰的比较，而在这篇文章（第 2 部分）中，比较结合了 Neo4j 中的知识图创建和检索，使用 cypher 与 FAISS 基线进行比较，以评估这两种方法对同一文档的 RAGAS 指标的表现。

下面的代码演练，笔记本托管在我的 Github 上。

设置环境

首先，让我们设置环境并导入必要的库：

import warnings
warnings.filterwarnings('ignore')

import os
import asyncio
import nest_asyncio
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from dotenv import load_dotenv
from typing import List, Dict, Union
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Neo4jVector, FAISS
from langchain_core.retrievers import BaseRetriever
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain.chat_models import ChatOpenAI
from langchain.schema import Document
from neo4j import GraphDatabase
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_relevancy, context_recall
from datasets import Dataset
import random
import re
from tqdm.asyncio import tqdm
from concurrent.futures import ThreadPoolExecutor

# API keys
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")
neo4j_url = os.getenv("NEO4J_URL")
neo4j_user = os.getenv("NEO4J_USER")
neo4j_password = os.getenv("NEO4J_PASSWORD")

设置 Neo4j 连接

要使用 Neo4j 作为图形数据库，让我们设置连接并创建一些实用函数：

# Connection strings
driver = GraphDatabase.driver(neo4j_url, auth=(neo4j_user, neo4j_password))

# Function to clear the Neo4j instance
def clear_neo4j_data(tx):
    tx.run("MATCH (n) DETACH DELETE n")

# Ensure vector index exists in Neo4j
def ensure_vector_index(recreate=False):
    with driver.session() as session:
        result = session.run("""
        SHOW INDEXES
        YIELD name, labelsOrTypes, properties
        WHERE name = 'entity_index'
          AND labelsOrTypes = ['Entity']
          AND properties = ['embedding']
        RETURN count(*) > 0 AS exists
        """).single()

        index_exists = result['exists'] if result else False

        if index_exists and recreate:
            session.run("DROP INDEX entity_index")
            print("Existing vector index 'entity_index' dropped.")
            index_exists = False

        if not index_exists:
            session.run("""
            CALL db.index.vector.createNodeIndex(
              'entity_index',
              'Entity',
              'embedding',
              1536,
              'cosine'
            )
            """)
            print("Vector index 'entity_index' created successfully.")
        else:
            print("Vector index 'entity_index' already exists. Skipping creation.")

# Add embeddings to entities in Neo4j
def add_embeddings_to_entities(tx, embeddings):
    query = """
    MATCH (e:Entity)
    WHERE e.embedding IS NULL
    WITH e LIMIT 100
    SET e.embedding = $embedding
    """
    entities = tx.run("MATCH (e:Entity) WHERE e.embedding IS NULL RETURN e.name AS name LIMIT 100").data()
    for entity in tqdm(entities, desc="Adding embeddings"):
        embedding = embeddings.embed_query(entity['name'])
        tx.run(query, embedding=embedding)

这些功能帮助我们管理 Neo4j 数据库，确保每次运行都有一个干净的石板，并且我们的向量索引设置正确。

数据处理和图形创建

现在，让我们加载数据并创建知识图谱：

# Load and process the PDF
pdf_path = "debate_transcript.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
texts = text_splitter.split_documents(documents)

# Function to create graph structure
def create_graph_structure(tx, texts):
    llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

    for text in tqdm(texts, desc="Creating graph structure"):
        prompt = ChatPromptTemplate.from_template(
            "Given the following text, identify key entities and their relationships. "
            "Format the output as a list of tuples, each on a new line: (entity1, relationship, entity2)\\n\\n"
            "Text: {text}\\n\\n"
            "Entities and Relationships:"
        )

        response = llm(prompt.format_messages(text=text.page_content))

        # Process the response and create nodes and relationships
        lines = response.content.strip().split('\\n')
        for line in lines:
            if line.startswith('(') and line.endswith(')'):
                parts = line[1:-1].split(',')
                if len(parts) == 3:
                    entity1, relationship, entity2 = [part.strip() for part in parts]
                    # Create nodes and relationship
                    query = (
                        "MERGE (e1:Entity {name: $entity1}) "
                        "MERGE (e2:Entity {name: $entity2}) "
                        "MERGE (e1)-[:RELATED {type: $relationship}]->(e2)"
                    )
                    tx.run(query, entity1=entity1, entity2=entity2, relationship=relationship)

这种方法使用 GPT-3.5-Turbo 从我们的文本中提取实体和关系，根据我们文档的内容创建一个动态知识图谱。

设置检索器

我们将设置两种类型的检索器：一种使用 FAISS 进行基于向量的检索，另一种使用 Neo4j 进行基于图形的检索。

# Embeddings model
embeddings = OpenAIEmbeddings()

# Create FAISS retriever
faiss_vector_store = FAISS.from_documents(texts, embeddings)
faiss_retriever = faiss_vector_store.as_retriever(search_kwargs={"k": 2})

# Neo4j retriever
def create_neo4j_retriever():
    # Clear existing data
    with driver.session() as session:
        session.run("MATCH (n) DETACH DELETE n") # equivalent to the clear_neo4j_data function created earlier in code

    # Create graph structure
    with driver.session() as session:
        session.execute_write(create_graph_structure, texts)

    # Add embeddings to entities
    with driver.session() as session:
        max_attempts = 10
        attempt = 0
        while attempt < max_attempts:
            count = session.execute_read(lambda tx: tx.run("MATCH (e:Entity) WHERE e.embedding IS NULL RETURN COUNT(e) AS count").single()['count'])
            if count == 0:
                break
            session.execute_write(add_embeddings_to_entities, embeddings)
            attempt += 1
        if attempt == max_attempts:
            print("Warning: Not all entities have embeddings after maximum attempts.")

    # Create Neo4j retriever
    neo4j_vector_store = Neo4jVector.from_existing_index(
        embeddings,
        url=neo4j_url,
        username=neo4j_user,
        password=neo4j_password,
        index_name="entity_index",
        node_label="Entity",
        text_node_property="name",
        embedding_node_property="embedding"
    )
    return neo4j_vector_store.as_retriever(search_kwargs={"k": 2})

# Cypher-based retriever
def cypher_retriever(search_term: str) -> List[Document]:
    with driver.session() as session:
        result = session.run(
            """
            MATCH (e:Entity)
            WHERE e.name CONTAINS $search_term
            RETURN e.name AS name, [(e)-[r:RELATED]->(related) | related.name + ' (' + r.type + ')'] AS related
            LIMIT 2
            """,
            search_term=search_term
        )
        documents = []
        for record in result:
            content = f"Entity: {record['name']}\\nRelated: {', '.join(record['related'])}"
            documents.append(Document(page_content=content))
        return documents

FAISS 检索器使用向量相似性来查找相关信息，而 Neo4j 检索器利用图形结构来查找相关实体及其关系。

创建 RAG 链

现在，让我们创建我们的 RAG 链：

def create_rag_chain(retriever):
    llm = ChatOpenAI(model_name="gpt-3.5-turbo")
    template = """Answer the question based on the following context:
    {context}

    Question: {question}
    Answer:"""
    prompt = PromptTemplate.from_template(template)

    if callable(retriever):
        # For Cypher retriever
        retriever_func = lambda q: retriever(q)
    else:
        # For FAISS retriever
        retriever_func = retriever

    return (
        {"context": retriever_func, "question": RunnablePassthrough()}
        | prompt
        | llm
        | StrOutputParser()
    )

# Create RAG chains
faiss_rag_chain = create_rag_chain(faiss_retriever)
cypher_rag_chain = create_rag_chain(cypher_retriever)

这些链将检索器与语言模型相关联，以根据检索到的上下文生成答案。

评估设置

为了评估我们的 RAG 系统，我们将创建一个真值数据集并使用 RAGAS 框架：

def create_ground_truth(texts: List[Union[str, Document]], num_questions: int = 100) -> List[Dict]:
    llm_ground_truth = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0.2)

    def get_text(item):
        return item.page_content if isinstance(item, Document) else item

    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    all_splits = text_splitter.split_text(' '.join(get_text(doc) for doc in texts))

    ground_truth = []

    question_prompt = ChatPromptTemplate.from_template(
        "Given the following text, generate {num_questions} diverse and specific questions that can be answered based on the information in the text. "
        "Provide the questions as a numbered list.\\n\\nText: {text}\\n\\nQuestions:"
    )

    all_questions = []
    for split in tqdm(all_splits, desc="Generating questions"):
        response = llm_ground_truth(question_prompt.format_messages(num_questions=3, text=split))
        questions = response.content.strip().split('\\n')
        all_questions.extend([q.split('. ', 1)[1] if '. ' in q else q for q in questions])

    random.shuffle(all_questions)
    selected_questions = all_questions[:num_questions]

    llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

    for question in tqdm(selected_questions, desc="Generating ground truth"):
        answer_prompt = ChatPromptTemplate.from_template(
            "Given the following question, provide a concise and accurate answer based on the information available. "
            "If the answer is not directly available, respond with 'Information not available in the given context.'\\n\\nQuestion: {question}\\n\\nAnswer:"
        )
        answer_response = llm(answer_prompt.format_messages(question=question))
        answer = answer_response.content.strip()

        context_prompt = ChatPromptTemplate.from_template(
            "Given the following question and answer, provide a brief, relevant context that supports this answer. "
            "If no relevant context is available, respond with 'No relevant context available.'\\n\\n"
            "Question: {question}\\nAnswer: {answer}\\n\\nRelevant context:"
        )
        context_response = llm(context_prompt.format_messages(question=question, answer=answer))
        context = context_response.content.strip()

        ground_truth.append({
            "question": question,
            "answer": answer,
            "context": context,
        })

    return ground_truth

async def evaluate_rag_async(rag_chain, ground_truth, name):
    splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

    generated_answers = []
    for item in tqdm(ground_truth, desc=f"Evaluating {name}"):
        question = splitter.split_text(item["question"])[0]

        try:
            answer = await rag_chain.ainvoke(question)
        except AttributeError:
            answer = rag_chain.invoke(question)

        truncated_answer = splitter.split_text(str(answer))[0]
        truncated_context = splitter.split_text(item["context"])[0]
        truncated_ground_truth = splitter.split_text(item["answer"])[0]

        generated_answers.append({
            "question": question,
            "answer": truncated_answer,
            "contexts": [truncated_context],
            "ground_truth": truncated_ground_truth
        })

    dataset = Dataset.from_pandas(pd.DataFrame(generated_answers))

    result = evaluate(
        dataset,
        metrics=[
            context_relevancy,
            faithfulness,
            answer_relevancy,
            context_recall,
        ]
    )

    return {name: result}
async def run_evaluations(rag_chains, ground_truth):
    results = {}
    for name, chain in rag_chains.items():
        result = await evaluate_rag_async(chain, ground_truth, name)
        results.update(result)
    return results

# Main execution function
async def main():
    # Ensure vector index
    ensure_vector_index(recreate=True)

    # Create retrievers
    neo4j_retriever = create_neo4j_retriever()

    # Create RAG chains
    faiss_rag_chain = create_rag_chain(faiss_retriever)
    neo4j_rag_chain = create_rag_chain(neo4j_retriever)

    # Generate ground truth
    ground_truth = create_ground_truth(texts)

    # Run evaluations
    rag_chains = {
        "FAISS": faiss_rag_chain,
        "Neo4j": neo4j_rag_chain
    }
    results = await run_evaluations(rag_chains, ground_truth)
    return results

# Run the main function
if __name__ == "__main__":
    nest_asyncio.apply()
    try:
        results = asyncio.run(asyncio.wait_for(main(), timeout=7200))  # 2 hour timeout
        plot_results(results)

        # Print detailed results
        for name, result in results.items():
            print(f"Results for {name}:")
            print(result)
            print()
    except asyncio.TimeoutError:
        print("Evaluation timed out after 2 hours.")
    finally:
        # Close the Neo4j driver
        driver.close()

此设置会创建一个 Ground Truth 数据集，使用 RAGAS 指标评估我们的 RAG 链，并将结果可视化。

结果和分析

该分析揭示了 GraphRAG 和基于向量的 RAG 在大多数指标上的性能惊人相似，但有一个区别：

忠诚：
Neo4j GraphRAG 的表现明显优于 FAISS（0.54 对 0.18）

基于图形的方法在忠实度方面表现出色，可能是因为它保留了信息的关系上下文。在检索信息时，它可以遵循实体之间的显式关系，确保检索到的上下文与文档中信息的原始结构更紧密地保持一致。

影响和用例

虽然整体性能相似性表明，对于许多应用程序，在基于图形和基于向量的 RAG 之间进行选择可能不会对结果产生重大影响，但在某些特定情况下，GraphRAG 在忠实度方面的优势可能至关重要：

忠实度关键型应用程序：在保持确切关系和上下文至关重要的领域（例如，法律或医学领域），GraphRAG 可以提供显着的好处。
复杂关系查询：对于涉及实体之间复杂联系的场景（例如，调查金融网络或分析社会关系），GraphRAG 遍历关系的能力可能是有利的。
维护和更新：FAISS 等基于向量的系统可能更容易维护和更新，尤其是对于频繁变化的数据集。
计算资源：大多数指标的相似性能表明，根据特定用例和可用资源，设置和维护图形数据库的额外复杂性可能并不总是合理的。

关于潜在偏差的说明：

知识图谱创建：图结构是使用 GPT-3.5-Turbo 创建的，这可能会在实体和关系的提取方式上引入自身的偏差或不一致。
检索方法：FAISS 检索器使用向量相似性搜索，而 Neo4j 检索器使用 Cypher 查询。这些根本不同的方法可能偏向于某些类型的查询或信息结构，但这就是正在评估的内容。
上下文窗口限制：这两种方法都使用固定的上下文窗口大小，如果需要任何不同，则可能无法捕获知识图谱结构的全部复杂性。
数据集特异性：总体（这是所有 AI 工具分析的 100% 给定的）：分析是针对单个文档（辩论记录）进行的，这可能并不代表所有潜在的用例。

【声明】内容源于网络

Super Intelligence

AI 企业落地与自动化实践 | 打造好用的 Agentic AI 工具 | 分享灵感、技术和脑洞 | 一起做点有趣又有用的东西

内容 42

粉丝 0

Super Intelligence AI 企业落地与自动化实践 | 打造好用的 Agentic AI 工具 | 分享灵感、技术和脑洞 | 一起做点有趣又有用的东西

总阅读101

粉丝0

内容42