使用 vLLM、LangChain 与 Chroma 构建检索增强生成（RAG）流程



使用 vLLM、LangChain 与 Chroma 构建检索增强生成（RAG）流程

AMD开发者中心

2025-11-21

原文作者：Rasmus Larsson、Emelie Wahlstrom、Saroosh Shabbir

在这篇来自AMD Silo AI 项目的博客中，我们将从零开始搭建一个简单的 RAG（Retrieval-Augmented Generation，检索增强生成）流程。预训练大模型虽然很强大，但无法直接访问企业内部或私有知识。RAG 的目标，是在用户提问时从外部知识库中检索内容并注入提示词（prompt），让大模型基于最新且私有的上下文来回答问题。一个典型的 RAG 流程大致如下（如图 1 所示）：

1.Submit（提交）：用户发出问题（查询）。

2.Retrieve（检索）：根据用户问题，从向量数据库中检索出与之语义相关的文本片段（上下文）。

3.Augment（增强）：将检索到的上下文与用户问题拼接，构成完整提示词。

4.Generate（生成）：大模型基于“问题 + 上下文”生成带有语境的回答。

图1：RAG 流程概览

这篇博客会用vLLM、LangChain 和 Chroma，手把手搭建一条入门级 RAG 流程。想了解更多框架组合（如 LlamaIndex [1]、FAISS [2]、Haystack [3]、LangGraph [4]），可在文末查看原文链接。

PART.01

前置条件

要运行本教程，你的系统需要满足以下条件

GPU：请确保你使用的是AMD GPU，或其他支持 ROCm 的兼容 GPU。

主机系统：满足ROCm 系统要求 [5]。

ROCm 6.4：按照官方ROCm 安装指南 [6] 安装并验证 ROCm。在终端中运行以下命令确认安装：

rocm-smi

Docker：
确保已正确安装并配置Docker。可参考官方的 Docker 安装指南 [7]。

PART.02

准备环境

拉取Docker 镜像

docker pull rocm/vllm:rocm6.4.1_vllm_0.10.1_20250909

vLLM 是一个开源库，专门用于高吞吐、低延迟的大语言模型推理。它引入了一种专门针对大模型推理的内存管理机制——PagedAttention，用于更高效地管理 KV Cache，可以更好地处理多并发请求，并最大化 GPU 利用率。

在搭配AMD ROCm 使用时，vLLM 会自动利用 ROCm 提供的高性能算子与运行时，加速 AMD GPU上的推理。此处使用的镜像和 vllm serve 命令参数参考了官方关于 vLLM 搭配 ROCm 的使用指南 [8]（可在 AMD ROCm 文档站点中搜索相关章节）。

启动Docker 容器

docker run -it --rm \    -p 8888:8888 \    --device=/dev/kfd \    --device=/dev/dri \    --group-add video \    --shm-size 16G \    --security-opt seccomp=unconfined \    --security-opt apparmor=unconfined \    -v $(pwd):/workspace \    -w /workspace/notebooks \    --env HUGGINGFACE_HUB_CACHE=/workspace \    --name test \    rocm/vllm:rocm6.4.1_vllm_0.10.1_20250909

注意：

1.-v $(pwd):/workspace 会把当前目录挂载到容器内 /workspace。

2.默认传入--device /dev/dri 会让容器看到系统上所有 GPU。如果你只想给容器暴露部分 GPU，可以只传入对应的设备节点。如何限制容器访问指定 GPU，可参考 ROCm 文档中关于限制 GPU 访问的说明。

安装依赖

在容器内部，安装本教程所需的Python 依赖：

pip install jupyter chromadb sentence_transformers langchain

确认安装版本：

pip list | grep -E 'chromadb|jupyterlab|sentence-transformers|langchain'

启动JupyterLab 服务器：

jupyter-lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root

注意：

请确保本机的8888 端口未被占用。如果已被占用，可以使用其他端口，例如 --port=8890（对应的 Docker 端口映射也要一起调整）。

如果你更习惯使用其他解释器（例如IPython），也可以采用对应方式运行代码，本教程以 JupyterLab 为例。

若你在远程机器上运行容器，需要在本地打开一个新终端，并使用如下SSH 端口转发来访问 Jupyter：

ssh -L 8888:localhost:8888 username@remote-hostname

PART.03

启动 vLLM 服务

本教程使用的模型是Qwen3-30B-A3B。如果你想换用其他模型也可以，但需要注意，有些模型在使用前必须先在Hugging Face 上接受许可协议并开通访问权限。

在Jupyter 的终端窗口 中，运行以下命令启动vLLM 服务：

vllm serve 支持非常多的启动参数，例如上下文长度、dtype 等，完整列表可见 vLLM 官方文档。这里的参数组合参考了 ROCm 文档 [9] 中针对该模型的推荐配置。

model="Qwen/Qwen3-30B-A3B"tp=1dtype="auto"kv_cache_dtype="auto"max_num_seqs=256max_seq_len_to_capture=32768max_num_batched_tokens=32768max_model_len=8192
vllm serve $model \    -tp $tp \    --dtype $dtype \    --kv-cache-dtype $kv_cache_dtype \    --max-num-seqs $max_num_seqs \    --max-seq-len-to-capture $max_seq_len_to_capture \    --max-num-batched-tokens $max_num_batched_tokens \    --max-model-len $max_model_len \    --no-enable-prefix-caching \    --host 0.0.0.0 \    --port 3000 \    --swap-space 16 \    --disable-log-requests \    --trust-remote-code \    --gpu-memory-utilization 0.9

等待模型加载完成并启动服务。当看到类似：INFO: Application startup complete，说明服务已经就绪，可以开始发送请求。

PART.04

测试服务连通性

在Notebook 中运行以下代码，通过/v1/chat/completions 接口给vLLM 发送一条测试请求，验证服务是否正常可用。这里我们关闭了“思考模式”（enable_thinking: False），因为本例不需要复杂推理解答。

# Import librariesimport requestsimport chromadbfrom chromadb.utils import embedding_functions from langchain_text_splitters import RecursiveCharacterTextSplitter

def query_vllm(system_prompt, user_query):    """    Query the vLLM server using the provided user question.    """    # Define URL    vllm_url = "http://localhost:3000/v1/chat/completions"
    # Prepare the payload    payload = {        "model": "Qwen/Qwen3-30B-A3B",        "messages": [            {"role": "system", "content": system_prompt},            {"role": "user", "content": user_query},        ],        # Disable reasoning        "chat_template_kwargs": {"enable_thinking": False}    }
    # Send the request to the vLLM server    response = requests.post(vllm_url, json=payload)
    # Parse the response    response_data = response.json()    return response_data["choices"][0]["message"]["content"]

注意：
请确保vllm serve 中指定的 --port 3000 与上面 vllm_url 中的端口号（http://localhost:3000）一致。如果该端口被占用，你可以修改端口，并在两处同时调整。

# vLLM setupsystem_prompt = "You are a helpful assistant. Answer the user's question. If you don't know the answer, say you don't know."test_query = "What is the capital of France?"
# Querytest_result = query_vllm(system_prompt=system_prompt, user_query=test_query)print(test_result)

PART.05

为什么需要 RAG？

我们先不使用RAG，直接调用大模型，来看看在缺少外部知识时它的表现。下面我们向模型询问一个虚构的产品信息，这类信息几乎不可能存在于模型的训练数据中。

# Prepare the system and user promptssystem_prompt = "You are a helpful assistant. Answer the user's question. If you don't know the answer, say you don't know."test_query = "What is the manufacturing ID of product ABC made by the fictional company XYZ Corp?"
# Querytest_result = query_vllm(system_prompt=system_prompt, user_query=test_query)print(test_result)

在这种情况下，模型要么告诉你不知道，要么会出现幻觉。这是符合预期的：因为模型对你虚构的公司XYZ Corp 没有任何先验信息。

说明：
当模型在超出其训练数据的主题上依然生成看似很“自信”的内容，但实际上是错误的，我们一般称之为“幻觉”（hallucination）。

要提升回答准确性，我们需要给模型额外的上下文。接下来，我们定义一个辅助函数，用来测试“给定上下文 + 问题”时，模型能否给出正确、且基于上下文的答案。注意下面代码中：

修改了system prompt，强调“只能基于上下文回答”；

上下文通过用户消息传入；

真实答案answer 作为对比用，不会传给模型。

def test_rag_scenario(context, query, answer=None):    """    Query the vLLM server using the provided user question and prompt    """    print("\n--- Testing RAG Scenario ---")    print(f"Question: {query}")    if answer:        print(f"Expected Answer: {answer}\n")
    system_prompt = (        "You are a helpful AI assistant. Answer the user's question based only on the provided context. "        "If the answer is not in the context, say you don't know."    )    user_message = (        f"Here is the context:\n---\n{context}\n---\n"        f"Based on only the context above, answer the following question: \nQuestion: {query}\nAnswer:"    )
    print("\n--- Query (user query sent to model) ---")    print(system_prompt)    print(user_message)
    response = query_vllm(system_prompt=system_prompt, user_query=user_message)    print("\n--- Generated response ---")    print(f"{response}")

现在，我们把包含答案的“产品规格说明”作为上下文传给模型，再问同样的问题：

# Example contextcontext = """Product specification - XYZ Corp:- Product ABC - ID: 12345- Product ABC - Name: Widget- Product ABC - Description: A useful widget- Product ABC - Price: $19.99"""
# Querytest_query = "What is the manufacturing ID of product ABC made by the fictional company XYZ Corp?"
# Expected answertrue_answer = "12345"
# Test the RAG scenariotest_rag_scenario(context=context, query=test_query, answer=true_answer)

因为这次我们提供了包含答案的上下文，模型就可以正确回答了。

PART.06

使用 Chroma 构建向量数据库

在实际场景中，文档通常会非常长，几十、几百页都很常见。如果把完整文档全部塞进提示词，既不高效，也受限于上下文长度。因此我们一般会使用向量数据库来保存文档。查询时先通过语义检索找到最相关的片段，再把这些片段一并传给大模型，让模型基于这些“外部知识”生成回答。本教程中我们使用 Chroma [10] 一个开源向量数据库，用来存储和检索文本嵌入。

配置 embedding 模型

Chroma 支持多种embedding 函数[11] 。这里我们使用默认的 SentenceTransformers模型 all-MiniLM-L6-v2，这是一个速度快、体积小的通用文本嵌入模型。

# Initialize the embedding function using SentenceTransformerembedding_model_name = "all-MiniLM-L6-v2"sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(    model_name=embedding_model_name)

创建向量数据库

下面把chroma_data_path 设置为本地存储路径，Chroma 会在这个目录下写入自身文件。运行之后，你会在工作目录中看到一个名为 chroma_db_storage 的文件夹。

chroma_data_path = "chroma_db_storage" # Local path to store the ChromaDB databasecollection_name = "rag_documents" # Name of the collection
# Initialize a persistent ChromaDB client# This will save data to the chroma_data_path directoryclient = chromadb.PersistentClient(path=chroma_data_path)
# Get or create the collection with the specified embedding function# The embedding function will be used to convert text to vectorscollection = client.get_or_create_collection(    name=collection_name,    embedding_function=sentence_transformer_ef)print(f"ChromaDB client initialized. Collection '{collection_name}' loaded/created.")print(f"Number of items in collection before adding: {collection.count()}")

添加文档

接下来，我们定义一段示例文本，并把它拆分后写入向量数据库。LangChain 提供了大量文档加载器 [12]，可以对接不同数据源。本教程只用一个包含多段内容的文本字符串作为示例。

example_text = """Marketing Strategy Overview - 2022:- Objective: Increase brand awareness and customer engagement- Target Audience: Segment A- Key Channels: Social media, influencer partnerships, and content marketing- Metrics for Success: Website traffic, social media engagement, and lead generation
Customer Segmentation Report - Q1 2023:- Segment A: Loyal customers, low frequency buyers- Segment B: Price-sensitive customers- Segment C: Occasional buyers, with a focus on seasonal promotions- Segment D: Loyal customers, engaged through our rewards program
Annual report XYZ - Q1 2023:- Revenue: $200M- EBIT: $50M- Profit: $10M
Annual report XYZ - Q1 2022:- Revenue: $180M- EBIT: $30M- Profit: $5M
Product specification - XYZ Corp:- Product ABC - ID: 12345- Product ABC - Name: Widget- Product ABC - Description: A useful widget- Product ABC - Price: $19.99"""

我们使用LangChain中的 RecursiveCharacterTextSplitter [13]，它会按层次（段落 → 句子 → 字符等）将长文本拆分成合适的块。然后我们将这些块写入 Chroma 集合。

查询向量数据库

要把合适的上下文提供给大模型，第一步是：从向量数据库中找出与用户问题语义最接近的那些片段，然后把它们拼成一个上下文字符串，供模型使用。

为了简化后续流程，我们再写一个小工具函数，用于查询Chroma 集合并返回合并好的上下文字符串：

def query_chroma(query_text, n_results=2):    """    Queries the ChromaDB collection and returns the most relevant documents.    """    print(f"\n--- Querying ChromaDB ---")    results = collection.query(query_texts=[query_text], n_results=n_results, include=['documents'])
    retrieved_docs_text = []    if results and results.get('documents') and results['documents'][0]:        for i, doc_text in enumerate(results['documents'][0]):            print(f"\nRetrieved Chunk #{i+1}:")            print(f"Text: {doc_text[:200]}...") # Print snippet            retrieved_docs_text.append(doc_text)    else:        print("No relevant documents found in ChromaDB for this query.")    return "\n\n".join(retrieved_docs_text) # Join retrieved docs into a single context string

提示：
n_results 是一个非常重要的超参数：

取值太小，可能会漏掉必要信息；

取值太大，则可能把无关内容一起塞给模型，反而增加噪声，让回答不够聚焦。

通常需要结合具体业务场景，通过实验来选取合适的值。

PART.07

构建 RAG 流程

现在，我们已经有了：

向vLLM 发送请求的函数 query_vllm；

从Chroma 向量数据库检索上下文的函数 query_chroma；

用于测试“上下文 + 问题”效果的函数 test_rag_scenario。

接下来只要把这些部分组合起来，就能构成一条完整的RAG 流程：

示例：查询产品制造ID

# User Queryuser_query = "What is the manufacturing ID of product ABC made by the fictional company XYZ Corp?"true_answer = "12345"
# Retrieve context from Chromaretrieved_context_for_llm = query_chroma(query_text=user_query, n_results=2)
# Test the RAG scenariotest_rag_scenario(context=retrieved_context_for_llm, query=user_query, answer=true_answer)

因为模型现在得到了包含“产品规格”的上下文，所以可以给出正确的答案：

Output truncated for readability.....
--- Generated response ---The manufacturing ID of product ABC made by the fictional company XYZ Corp is 12345.

PART.08

小结

在这篇博客中，我们完成了以下内容：

使用vLLM 搭建并配置大模型推理服务；

使用Chroma 创建向量数据库，并将文档拆分后写入；

通过检索增强（RAG）让模型正确回答企业内部问题。

这样的流程可扩展到企业文档、FAQ、报告等实际场景。如果你想探索不同风格或架构的 RAG 系统，可以回看本文开头列出的 AMD 相关教程。更多细节和高级功能，可参考各库的官方文档：vLLM 文档 [14]、Chroma 文档 [15]、LangChain 官网 [16]。

PART.09

免责声明

文中提及的第三方内容（包括但不限于链接的文档、代码、模型等），其许可均由各自的第三方权利方直接授予，并非由AMD 授权给你。所有链接的第三方内容均按“原样”（“AS IS”）提供，不附带任何形式的明示或默示担保。你在使用上述第三方内容时应自行决策并承担全部风险。在任何情况下，AMD 均不对因你使用第三方内容而产生的任何损失承担责任。

PART.10

参考链接

[1] RAG with LlamaIndex：https://rocm.blogs.amd.com/artificial-intelligence/rag-llamaindex/README.html

[2] RAG with LangChain and FAISS：https://rocm.blogs.amd.com/artificial-intelligence/langchain-chatbot/README.html

[3] From Ingestion to Inference: RAG Pipelines on AMD GPUs：https://rocm.blogs.amd.com/artificial-intelligence/rag-agent/README.html

[4] ROCm RAG repository：https://github.com/ROCm/rocm-rag

[5] ROCm 系统要求：https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html

[6] ROCm 安装指南：https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html

[7] Docker 安装指南：https://docs.docker.com/get-started/get-docker/

[8] 使用指南：https://rocm.docs.amd.com/en/docs-6.4.3/how-to/rocm-for-ai/inference/benchmark-docker/vllm.html?model=pyt_vllm_qwen3-30b-a3b

[9] ROCm文档：https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/benchmark-docker/vllm.html?model=pyt_vllm_qwen3-30b-a3b

[10] Chroma：https://docs.trychroma.com/docs/overview/introduction

[11] embedding 函数：https://docs.trychroma.com/docs/embeddings/embedding-functions

[12] 文档加载器：https://python.langchain.com/docs/integrations/document_loaders/

[13] RecursiveCharacterTextSplitter：https://python.langchain.com/api_reference/text_splitters/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html

[14] vLLM 文档：https://docs.vllm.ai/en/latest/

[15] Chroma 文档：https://docs.trychroma.com/docs/overview/getting-started

[16] LangChain 官网：https://www.langchain.com/

[17] 原文链接：Retrieval Augmented Generation (RAG) with vLLM, LangChain and Chroma：https://www.amd.com/en/developer/resources/technical-articles.html#sortCriteria=@amd_release_date%20descending&f-amd_blog_hardware_platforms=Instinct%20GPUs,Radeon%20Graphics&f-amd_blog_development_tools=ROCm%20Software

【声明】内容源于网络

AMD开发者中心

AMD开发者中心为开发者提供工具、技术和资源，助力构建AI解决方案。ROCm、Ryzen AI软件和ZenDNN，帮助您实现模型加速与部署。开发者可通过文档、SDK及教程快速上手。立即关注AMD开发者中心，开启智能未来！

内容 65

粉丝 0

AMD开发者中心 AMD开发者中心为开发者提供工具、技术和资源，助力构建AI解决方案。ROCm、Ryzen AI软件和ZenDNN，帮助您实现模型加速与部署。开发者可通过文档、SDK及教程快速上手。立即关注AMD开发者中心，开启智能未来！

总阅读49

粉丝0

内容65