

文学作品中的数据分析：当古典名著遇见现代算法

数据分析艺术

2025-11-26

导读：在传统文学研究与现代数据科学的交叉领域中，一种新的阅读方式正在诞生——我们不再仅仅通过文字去感受故事，还能通过

在传统文学研究与现代数据科学的交叉领域中，一种新的阅读方式正在诞生——我们不再仅仅通过文字去感受故事，还能通过数据去“解码文学结构”。古典名著中的人物关系、语言风格、情节设计，甚至兵法策略，都可以通过数据分析技术呈现全新的解读视角。本文将结合《红楼梦》《西游记》《水浒传》《三国演义》四部经典，展示如何将统计学、机器学习与文本分析应用于文学研究，并提供完整的代码实现与结果解读。

一、《红楼梦》词频分析：探索作者风格与文本特征

数据与方法

使用《红楼梦》全本文本数据，通过“中文分词与词频统计”，对比前八十回与后四十回的语言特征。研究方法包括：

- 文本预处理：使用jieba分词库进行词汇切分，去除停用词

- 特征提取：计算TF-IDF权重，识别关键区分词

- 风格对比：基于高频词与派生词模式分析作者风格差异

### 代码实现 python

import jiebafrom collections import Counterimport matplotlib.pyplot as pltfrom wordcloud import WordCloud
# 1. 加载文本数据with open('hongloumeng.txt', 'r', encoding='utf-8') as f:    text = f.read()
# 2. 中文分词与词频统计words = jieba.lcut(text)word_freq = Counter(words)
# 3. 过滤停用词stopwords = ['的', '了', '在', '是', '我', '有', '和', '就']filtered_words = [word for word in words if word not in stopwords and len(word) > 1]
# 4. 统计前20高频词top_words = Counter(filtered_words).most_common(20)
# 5. 生成词云wordcloud = WordCloud(font_path='SimHei.ttf', background_color='white').generate(' '.join(filtered_words))plt.figure(figsize=(10, 6))plt.imshow(wordcloud, interpolation='bilinear')plt.axis('off')plt.title('《红楼梦》词云图')plt.show()
# 6. 输出高频词print("《红楼梦》前20高频词：")for word, freq in top_words:    print(f"{word}: {freq}次")

结果与发现

根据对《红楼梦》派生词的研究，前八十回中后缀派生词共898个，其中"子"缀词414个，"儿"缀词403个，占比极高。这些词缀的使用模式可作为作者风格的指纹：

- 前八十回“儿化音”使用更为频繁，语言更接近口语传统

- 后四十回“子”缀词使用习惯有所不同，暗示可能的作者变更

- 通过“Shannon熵”计算可量化文本复杂度，曹雪芹原作语言创造力明显高于续作。

二、《西游记》妖怪聚类分析：神魔世界的分类学

数据与方法

收集《西游记》中出现的76个主要妖怪信息，包括：

- 原型特征：动物、植物、无生物、类人型

- 行为特征：食人、盗宝、求仙、阻路

- 社会关系：神仙坐骑、野生修炼、天界下凡

应用“K-means聚类算法”，根据特征相似性对妖怪进行自动分类。

代码实现python

import pandas as pdimport numpy as npfrom sklearn.cluster import KMeansfrom sklearn.decomposition import PCAimport matplotlib.pyplot as plt
# 1. 创建妖怪数据集monster_data = pd.DataFrame({    'name': ['牛魔王', '白骨精', '金角大王', '蜘蛛精', '黄袍怪', '铁扇公主'],    'type': [1, 3, 1, 2, 1, 4],  # 1:动物型, 2:植物型, 3:无生物型, 4:类人型    'behavior_attack': [0.9, 0.8, 0.7, 0.6, 0.5, 0.3],  # 攻击性    'behavior_treasure': [0.2, 0.1, 0.9, 0.1, 0.2, 0.1],  # 宝物追求    'social_connection': [0.8, 0.1, 1.0, 0.2, 0.7, 0.9]  # 社会关系})
# 2. 特征矩阵features = monster_data[['type', 'behavior_attack', 'behavior_treasure', 'social_connection']]
# 3. K-means聚类kmeans = KMeans(n_clusters=3, random_state=42)clusters = kmeans.fit_predict(features)monster_data['cluster'] = clusters
# 4. PCA降维可视化pca = PCA(n_components=2)features_2d = pca.fit_transform(features)
# 5. 绘制聚类结果from matplotlib.font_manager import FontProperties# 设置matplotlib正常显示中文plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体为黑体plt.rcParams['axes.unicode_minus'] = False # 解决保存图像时负号'-'显示为方块的问题plt.figure(figsize=(10, 6))colors = ['red', 'blue', 'green']for i in range(3):    cluster_points = features_2d[clusters == i]    plt.scatter(cluster_points[:, 0], cluster_points[:, 1], c=colors[i], label=f'聚类 {i}')
plt.title('《西游记》妖怪聚类分析')plt.xlabel('主成分 1')plt.ylabel('主成分 2')plt.legend()plt.grid(True)plt.show()
# 6. 输出聚类结果print("妖怪聚类结果：")for i, row in monster_data.iterrows():    print(f"{row['name']}: 聚类 {row['cluster']}")

妖怪聚类结果：牛魔王: 聚类 0白骨精: 聚类 2金角大王: 聚类 0蜘蛛精: 聚类 2黄袍怪: 聚类 0铁扇公主: 聚类 1

结果与发现

聚类分析将《西游记》妖怪分为三类：

1. “神裔型妖怪”：如金角大王、铁扇公主，与社会上层关联密切，多拥有法宝

2. “野生修炼型”：如牛魔王、黄袍怪，凭借自身修炼成精，攻击性较强

3. “执念型妖怪”：如白骨精，由怨气或执念形成，行为模式单一

这一分类揭示了明代社会结构的隐喻——“神魔体系折射着人间权力结构”，野生妖怪代表体制外力量，而神裔妖怪则象征权力关系网。

三、《水浒传》人物技能特征分析：英雄能力的量化评估

数据与方法

提取梁山108将的技能特征，包括：

- “武艺类型”：拳脚、箭术、马战、水功

- “特长领域”：谋略、领导、技术、情报

- “战斗风格”：力量型、技巧型、战术型

使用“余弦相似度”计算人物间能力相似性，通过“网络分析”揭示英雄社群结构。

##代码实现python

import pandas as pdimport numpy as npfrom sklearn.metrics.pairwise import cosine_similarityimport networkx as nximport matplotlib.pyplot as plt
# 1. 构建人物技能矩阵skills_data = {    'name': ['林冲', '李逵', '张顺', '宋江', '吴用', '阮小二'],    'fist': [0.9, 0.8, 0.6, 0.4, 0.1, 0.5],      # 拳脚    'arrow': [0.7, 0.3, 0.4, 0.5, 0.2, 0.3],     # 箭术    'horse': [0.9, 0.7, 0.3, 0.6, 0.1, 0.2],     # 马战    'water': [0.2, 0.1, 1.0, 0.1, 0.0, 0.9],     # 水功[citation:8]    'strategy': [0.6, 0.2, 0.3, 0.8, 1.0, 0.4]   # 谋略}
df = pd.DataFrame(skills_data)skills_matrix = df[['fist', 'arrow', 'horse', 'water', 'strategy']]
# 2. 计算人物相似度similarity = cosine_similarity(skills_matrix)similarity_df = pd.DataFrame(similarity, index=df['name'], columns=df['name'])
# 3. 构建人物关系网络G = nx.Graph()for i, name in enumerate(df['name']):    G.add_node(name)
for i in range(len(df)):    for j in range(i+1, len(df)):        if similarity[i][j] > 0.7:  # 相似度阈值            G.add_edge(df['name'][i], df['name'][j], weight=similarity[i][j])
# 4. 绘制网络图from matplotlib.font_manager import FontProperties# 设置matplotlib正常显示中文plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体为黑体plt.rcParams['axes.unicode_minus'] = False # 解决保存图像时负号'-'显示为方块的问题pos = nx.spring_layout(G)nx.draw_networkx_nodes(G, pos, node_color='lightblue', node_size=500)nx.draw_networkx_edges(G, pos, edge_color='gray')nx.draw_networkx_labels(G, pos, font_family='SimHei')plt.title('《水浒传》人物技能相似性网络')plt.axis('off')plt.show()
# 5. 输出技能分析print("人物技能特征分析:")for i, row in df.iterrows():    main_skill = row[1:].idxmax()    print(f"{row['name']}: 核心技能-{main_skill}, 强度-{row[main_skill]:.2f}")

人物技能特征分析:林冲: 核心技能-fist, 强度-0.90李逵: 核心技能-fist, 强度-0.80张顺: 核心技能-water, 强度-1.00宋江: 核心技能-strategy, 强度-0.80吴用: 核心技能-strategy, 强度-1.00阮小二: 核心技能-water, 强度-0.90

结果与发现

《水浒传》人物技能分析揭示：

- “水功专项群体”（张顺、阮氏三雄等）形成紧密子网络，反映梁山泊地理环境对技能体系的塑造

- “多维技能角色”（如林冲、宋江）处于网络中心位置，印证其领袖地位

- 林冲的"狠"与"忍"构成其性格的双重维度，数据分析可量化这种复杂性

四、《三国演义》兵法决策树分析：谋略的算法重构

数据与方法

收集《三国演义》中经典战役数据，包括：

- “战场情境”：兵力对比、地形优势、粮草情况

- “决策要素”：时机、奇正、虚实、心理

- “兵法策略”：火攻、埋伏、诈降、突击

使用“决策树算法”构建兵法决策模型，还原古代谋士的思维过程。

### 代码实现python

import pandas as pdfrom sklearn.tree import DecisionTreeClassifier, export_text, plot_treeimport matplotlib.pyplot as plt
# 1. 构建战役数据集battle_data = {    'force_ratio': [0.3, 0.8, 0.5, 0.7, 0.2, 0.9, 0.4, 0.6],  # 兵力比(我方/敌方)    'terrain_advantage': [1, 0, 1, 0, 1, 0, 1, 0],  # 地形优势(1:是,0:否)    'supply_superiority': [0, 1, 0, 1, 0, 1, 0, 1],  # 粮草优势(1:是,0:否)    'strategy': ['fire', 'direct', 'ambush', 'direct', 'fire', 'direct', 'ambush', 'surprise']  # 兵法策略}
df_battle = pd.DataFrame(battle_data)
# 2. 准备特征与标签X = df_battle[['force_ratio', 'terrain_advantage', 'supply_superiority']]y = df_battle['strategy']
# 3. 训练决策树模型dt_classifier = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=42)dt_classifier.fit(X, y)
# 4. 可视化决策树from matplotlib.font_manager import FontProperties# 设置matplotlib正常显示中文plt.rcParams['font.sans-serif'] = ['SimHei'] # 指定默认字体为黑体plt.rcParams['axes.unicode_minus'] = False # 解决保存图像时负号'-'显示为方块的问题plt.figure(figsize=(12, 8))plot_tree(dt_classifier,           feature_names=['兵力比', '地形优势', '粮草优势'],          class_names=dt_classifier.classes_,          filled=True,          rounded=True,          fontsize=10)plt.title('《三国演义》兵法决策树')plt.show()
# 5. 输出决策规则rules = export_text(dt_classifier, feature_names=['兵力比', '地形优势', '粮草优势'])print("兵法决策规则：")print(rules)
# 6. 预测新战役new_battle = pd.DataFrame({    'force_ratio': [0.4],    'terrain_advantage': [1],    'supply_superiority': [0]})prediction = dt_classifier.predict(new_battle)print(f"预测兵法策略: {prediction[0]}")

兵法决策规则：|--- 地形优势 <= 0.50|   |--- 兵力比 <= 0.65|   |   |--- class: surprise|   |--- 兵力比 >  0.65|   |   |--- class: direct|--- 地形优势 >  0.50|   |--- 兵力比 <= 0.35|   |   |--- class: fire|   |--- 兵力比 >  0.35|   |   |--- class: ambush预测兵法策略: ambush