【人工智能】如何使用 ChatGPT 和 Python 根据自己的文章在 Neo4j 中构建知识图谱- 大数跨境

首页

【人工智能】如何使用 ChatGPT 和 Python 根据自己的文章在 Neo4j 中构建知识图谱

学汇百川教育

2023-09-22

导读：本文中，我将展示如何使用图形技术和一些编程来构建和探索自己文章的内容。

点击图片即可查看课程详情

本文中，我将展示如何使用图形技术和一些编程来构建和探索自己文章的内容。

使用 NLP 技术构建非结构化数据的想法并不新鲜，然而，LLM（大型语言模型）的最新进展激发了无数实现这一目标的机会。业余爱好者可以通过蓬勃发展的技术 ChatGPT 进行访问，这引起了人们对法学硕士和发电机模型的广泛关注。

事实上，生成式人工智能已经被许多公司提上议程！

入门

计划如下。

让 API 工作并通过 Python 访问它。
使用示例文本进行提示工程，确保 GPT-4 模型理解您想要从中得到什么。
下载您指定的文章，并预处理数据。
从 ChatGPT 中提取并收集输出。
对 ChatGPT 的输出进行后处理
使用 Cypher 查询语言编写代码以将数据进一步构造成图表。
和你最好的新朋友一起玩，探索你的文章。

话不多说，让我们开始快速设置基本技术。

设置

我们需要在本地计算机上安装编程语言 Python 和图数据库 Neo4j。

首先要做的是确保您在 OpenAI 拥有 plus 帐户，以便可以使用 GPT-4。您应该确保的第二件事是您已注册使用 API。一旦到位，您需要生成一个API 密钥。然后你需要 pip install openai。

在连接到 ChatGPT 之前，我们先进入浏览器并尝试找到询问此任务的正确方式。这称为即时工程，正确执行非常重要。通过尝试不同的提问方式，以我的一篇随机文章为例，我发现正确的提问方式是在给出实际文本之前提供详细的指导性说明。

我最终得到了以下提示：

作为例子，我给出了我很久以前写的关于 Gamma 函数的文章中的一个片段：

结果如下：

尽管它并没有真正理解任务，但它做得还不错，尤其是格式。然而，有时它会创建重复项，并注意它会产生一些实体和关系的幻觉，即使我们要求它不要这样做。我们稍后会处理这个问题。

为了将来的使用，我们将把这个指令存储在一个名为prompt_input.py的Python文件中。

entities = ["Mathematical entity", "Person", "Location", "Animal", "Activity", "Programming language", "Equation", "Date", "Shape", "Property", "Mathematical expression", "Profession", "Time period", "Mathematical subject", "Mathematical concept", "Discipline", "Mathematical theorem", "Physical entity", "Physics subject", "Physics"]relationships = ["IS", "ARE", "WAS", "EQUIVALENT_TO", "CONTAINS", "PROPOSED", "PARTICIPATED_IN", "SOLVED", "RELATED_TO", "CORRESPONDS_TO", "HAS_PROPERTY", "REPRESENTS", "IS_USED_IN", "DISCOVERED", "FOUND", "IS_SOLUTION_TO", "PROVED", "LIVED_IN", "LIKED", "BORN_IN", "CONTRIBUTED_TO", "IMPLIES", "DESCRIBES", "DEVELOPED", "HAS_PROPERTY", "USED_FOR"]
prompt = f"""You are a mathematician and a scientist helping us extract relevant information from articles about mathematics. The task is to extract as many relevant relationships between entities to mathematics, physics, or history and science in general as possible.The entities should include all persons, mathematical entities, locations etc. Specifically, the only entity tags you may use are:{', '.join(entities)}.The only relationships you may use are:{', '.join(relationships)}As an example, if the text is "Euler was located in Sankt Petersburg in the 17 hundreds", the output should have the following format: Euler: Person, LIVED_IN, Skt. Petersburg: Location If we have "In 1859, Riemann proved Theorem A", then as an output you should return Riemann: Person, PROVED, Theorem A: Mathematical theoremI am only interested in the relationships in the above format and you can only use what you find in the text provided. Also, you should not provide relationships already found and you should choose less than 100 relationships and the most important ones.You should only take the most important relationships as the aim is to build a knowledge graph. Rather a few but contextual meaningful than many nonsensical. Moreover, you should only tag entities with one of the allowed tags if it truly fits that category and I am only interested in general entities such as "Shape HAS Area" rather than "Shape HAS Area 1".The input text is the following:"""

现在基本设置已经就位，让我们测试一下它是否真的有效。

如果代码只适合您并且仅在您的本地计算机上，您可以将 API 密钥硬编码到 Python 文件中，否则您可以将其设置为环境变量或将其放置在您不推送到任何地方的配置文件中！

让我们测试一下这个设置是否有效。我们创建一个名为 connect.py 的文件，其中包含从 Python 到 ChatGPT 的基本连接。

import osimport openaifrom prompt_input import prompt
openai.api_key = "<Your API key goes here>"
def process_gpt4(text):    """This function prompts the gpt-4 model and returns the output"""
    response = openai.ChatCompletion.create(        model="gpt-4",        temperature=0,        messages=[            {"role": "user", "content": prompt + text},        ],    )
    result = response['choices'][0]['message']['content']
    return result

我们验证这是否有效！

数据

我需要从不同平台获取文章内容。这些文件可能是 HTML 文件！如果您想在浏览器中阅读它们，这当然很好，但如果您想使用纯文本，那就不行了。

找到网上内容直接报错或下载页面，下载的文件的文件名相当混乱。标准名称例如为“xxxxxxxxxxxxxx.html ”

让我们将这些文件存储在名为raw的文件夹中。

我们编写了一个名为 extract_text_from_html.py 的小模块，其中包含从这些文件中提取文本的一些功能：

from bs4 import BeautifulSoup

def extract_text_from_html(html_content):    """This function extracts the text from the articles"""        soup = BeautifulSoup(html_content, 'html.parser')        for script in soup(["script", "style"]):        script.extract()
    article_tag = soup.find('article')    if article_tag:        return " ".join(article_tag.stripped_strings)

在我们可以使用它从 ChatGPT 获取结果之前，我们需要能够将文本分成批次。原因是 GPT-4 有代币限制。在一个名为 preprocess.py 的文件中，我们写道：

def text_to_batches(s, batch_size=2000):    words = s.split()    batches = []        for i in range(0, len(words), batch_size):        batch = ' '.join(words[i:i+batch_size])        batches.append(batch)            return batches

现在我们已准备好从 ChatGPT 实际获取一些数据。

我们编写一个名为 process_articels.py 的文件，其中循环遍历文章，从令人恐惧的文件名中检索标题，从 HTML 文件中提取实际文本，通过 ChatGPT 运行每批文本，从文件中收集结果，然后保存模型的输出存储在名为 data 的文件夹中的新文件中。我们还将实际文本保存在名为 clean 的文件夹中以供以后使用。

但实际上，代码很简单，因为我们已经在其他文件中完成了一些工作。

import osfrom tqdm import tqdmfrom connect import process_gpt4from extract_text import extract_text_from_htmlfrom preprocess import text_to_batches
base_path = 'raw'processed_articles = os.listdir('data')
for file_name in tqdm(os.listdir(base_path)):
    title = ' '.join(file_name.split('_')[-1].split('-')[:-1])    if f'results_{title}.txt' in processed_articles:        continue
    results = ''    with open(os.path.join(base_path, file_name), 'r', encoding='utf-8') as f:        content = f.read()        extraction = extract_text_from_html(content)        batches = text_to_batches(extraction)        for batch in batches:            gpt_results = process_gpt4(batch)            results += gpt_results
    with open(f'data/results_{title}.txt', 'w', encoding='utf-8') as results_file:         results_file.write(results)            with open(f'cleaned/cleaned_{title}.txt', 'w', encoding='utf-8') as cleaned_file:        cleaned_file.write(extraction)

上述代码可能需要一段时间才能执行，因为与其他可用的表现不佳的模型相比，GPT-4 模型相对较慢。我们确保使用兑现设置，这样如果程序崩溃，我们就不会从头开始，而是从上次中断的地方开始。

现在（痛苦地度过了几个小时后）我们有了 GPT-4 结果的结构化数据集。完美的。现在我们“只”需要从中构建一个图表。

构建知识图谱

我们将把预处理和图形创建过程合并到一个函数中。这通常不是非常可取（关注点和全部分离），但是因为我们需要在预处理中查看“关系和实体”级别，所以我们不妨在我们有我们的数据时在图中创建节点和关系。

让我们创建一个包含驱动程序的小型 API，以便我们可以与我们的图表进行交互。

from neo4j import GraphDatabase

class LoadGraphData:    def __init__(self, username, password, uri):        self.username = username        self.password = password        self.uri = uri        self.driver = GraphDatabase.driver(self.uri, auth=(self.username, self.password))
    def create(self, query):        with self.driver.session() as graphDB_Session:            return graphDB_Session.run(query)
    def set_max_nodes(self, number):        query = f":config initialNodeDisplay: {number}"        with self.driver.session() as graphDB_Session:            return graphDB_Session.run(query)
    def delete_graph(self):        delete = "MATCH (n) DETACH DELETE n"        with self.driver.session() as graphDB_Session:            graphDB_Session.run(delete)
    @staticmethod    def do_cypher_tx(tx, cypher):        result = tx.run(cypher)        values = []        for record in result:            values.append(record.values())        return values
    def work_with_data(self, query):        with self.driver.session() as session:            values = session.read_transaction(self.do_cypher_tx, query)        return values

我们需要循环结果，确保实体不会太长，清理结果，并定义 gpt 模型输出中的节点和关系，我们不想使用相同的查询来调用图多次，我们希望实体能够连接到原始文章，然后我们需要确保 ChatGPT 的每个实体和关系实际上都在文本中！

上述要求的最后一个是提高我们信任该图的概率，如果你仔细想想，它并不是万无一失的。

import osimport refrom prompt_input import entities, relationshipsfrom loader import LoadGraphDatafrom tqdm import tqdm

def create_relationships(loader, title, e1, l1, e2, l2, R):    query = f'MERGE (:Article {{name: "{title}"}})\            MERGE (:{l1} {{name: "{e1}"}})\            MERGE (:{l2} {{name: "{e2}"}})'    loader.create(query)
    query = f'MATCH (t:Article {{name: "{title}"}})\            MATCH (a:{l1} {{name: "{e1}"}})\            MATCH (b:{l2} {{name: "{e2}"}})\            MERGE (a)-[:{R}]->(b)\            MERGE (a)-[:IN_ARTICLE]->(t)\            MERGE (b)-[:IN_ARTICLE]->(t)'    loader.create(query)


def make_graph(source, cleaned):    loader = LoadGraphData("neo4j", "<password>", "bolt://localhost:7687")    loader.delete_graph()
    history = []    for results in tqdm(os.listdir(source)):        with open(os.path.join(source, results)) as r:            content = r.read()            lines = content.split('\n')        if len(lines) < 10:            continue
        with open(os.path.join(cleaned, 'cleaned_' + '_'.join(results.split('_')[1:]))) as c:            cleaned_content = c.read()
        for line in lines:            line = re.sub('^\d+\.', '', line).strip()            splitted = line.split(',')            if len(splitted) == 3:                A = splitted[0]                R = splitted[1].strip()                B = splitted[2]
                if not ':' in A or not ':' in B:                    continue
                e1, l1 = A.split(':')[0], A.split(':')[1]                e2, l2 = B.split(':')[0], B.split(':')[1]                e1, e2, l1, l2 = e1.strip(), e2.strip(), l1.strip(), l2.strip()
                if e1.lower() not in cleaned_content.lower() or e2.lower() not in cleaned_content.lower():                    continue
                if l1 == 'Person':                    for subname in  e1.split()[::-1]:                        if subname[0].upper() == subname[0]:                            e1 = subname                            break                                if l2 == 'Person':                    for subname in  e2.split()[::-1]:                        if subname[0].upper() == subname[0]:                            e2 = subname                            break                                if R == R.upper() and R in relationships and l1 in entities and l2 in entities and len(e1.split()) < 5 and len(e1) > 1 and len(e2.split()) < 5 and len(e2) > 1 and e1 != e2:                    if line not in history:                        history.append(line)
                        l1 = l1.replace(" ", "_")                        l2 = l2.replace(" ", "_")                        e1 = e1.replace('"', '')                        e2 = e2.replace('"', '')                        title = results.split('.')[0].replace(' ', '_')                        title = '_'.join(title.split('_')[1:])
                        create_relationships(loader=loader, title=title, e1=e1, l1=l1, e2=e2, l2=l2, R=R)

当然，创建知识图模式的方法有很多。弄清楚什么应该是节点、什么应该是不容易的，但由于我们不希望关系之间存在关系，所以我们采用了上述方法。

此外，我为本文选择了简约的方法。通常，我们会用更多的属性来丰富节点和关系。

现在我们只需要一个主要切入点。

from make_graph import make_graph
make_graph(source='data', cleaned='cleaned')

这就对了。现在我们有了一个知识图，事实上，我们有大约 2000 个节点和 4500 个关系。

探索图表

那么我们可以用这个东西做什么呢？我们应该向它提出什么要求呢？

让我们尝试找出在多少篇文章中找到了不同的人。我们有以下内容：

毫不奇怪，欧拉位居榜首，如果我愿意的话，我当然可以找到文章的标题，但让我们继续吧。

让我们尝试一些别的东西。让我们看看有多少文章同时提到了Riemann 和 Euler.

让我们看看我的文章提到了多少 Euler 的发现。

让我们看看有多少篇文章与群论文章共享一些数学关键字。

结果在此处显示为由上图中的非橙色节点连接的 27 篇其他文章。尽管这只是一个 Demo 示例，但人们可以想象这如何也能显示业务相关文档如何通过 GDPR 或审计等某些学科中重要的各种敏感关键字进行关联。

要点

这项工作应该被视为我们所说的“概念验证”。我们实际上不能将我的文章用于任何用途，但如果这是来自一家公司的文本，其中包含有关其客户和员工的信息（从电子邮件到 word 文件、pdf 等），那么这可以用来绘制客户之间的关系以及哪些员工密切合作。

这反过来又能让我们 360 度地了解数据如何在整个组织中流动、谁是特定类型信息流中最重要的人、如果您想了解某项信息，应该联系谁？特定主题或文件、我们部门联系过的客户曾经被其他部门联系过等。

极其有价值的信息。当然，我们不能为此使用 ChatGPT，因为我们不知道发送的数据会发生什么。因此，向其询问敏感或关键业务信息并不是一个好主意。我们需要做的是下载另一个只存在于我们笔记本电脑上的 LLM（大语言模型）。本地 LLM。我们甚至可以根据自己的数据对其进行微调。正如我们所说，许多公司已经这样做了，用于构建聊天机器人、助手等。

但是，如果你问我的话，使用它来构建非结构化数据的知识图是下一个级别，我认为我已经证明它是完全可行的！