离大谱了,弱智吧(Ruozhiba)登上正经AI论文,还成了最好的中文训练数据?

弱智吧是一个充满荒谬、离奇、不合常理发言的中文社区,画风通常是这样的:
Ruozhiba, which literally translates to “Idiot Sub-forum”, is a bizarre corner of the Chinese internet. This sub-forum on Reddit-like Baidu Tieba is filled with ridiculous, pun-filled, logically challenging threads that will twist your brain into a pretzel.


The unique linguistic features and cultural depth of the Chinese language pose challenges for instruction tuning tasks.
The available datasets for Chinese instruction tuning have inherent issues such as not aligning with natural Chinese communication patterns, lacking genuine Chinese linguistic data, containing numerous problematic data points, and having small-scale data.
Therefore, the research team collects a high-quality human-written corpus from various sources on the Chinese Internet, including Q&A communities, Wikis, examinations.
具体来说,使用弱智吧数据训练的大模型,跑分超过百科、知乎、豆瓣、小红书等平台,甚至是研究团队精心挑选的数据集。在问答、头脑风暴、分类、生成、总结、提取等8项测试中取得最高分。
This corpus was rigorously filtered and carefully processed to form the COIG-CQIA dataset.
在所有子集中,“弱智吧”平均排名第二。我们推测,这是因为它可能会增强模型的逻辑推理能力,从而使大多数指令跟踪任务受益。 Ruozhiba ranks second on average across all subsets. We conjecture this is because it may enhance the model’s logical reasoning ability, thereby benefiting most of the instruct-following tasks.
推 荐 阅 读



