大数跨境

姚顺雨:AI 的上半场结束了,真正的比赛现在才开始

姚顺雨:AI 的上半场结束了,真正的比赛现在才开始 知更小筑
2026-01-21
2
导读:AI 进入了“下半场”:重点不再是“能不能解决问题”,而是“我们该让 AI 解决什么问题、以及如何衡量真正有用的进步”

过去几十年,人工智能的“上半场”主要在比拼谁能训练出更强的模型、刷出更高的榜单分数:从 AlphaGo 到 GPT-4,从 SAT、Bar Exam 到 IMO、IOI,AI 在考试和竞赛中不断超越人类。但姚顺雨认为,这个阶段已经接近尾声。

现在,真正改变游戏规则的是一个“通用配方”——大规模语言预训练 + 推理能力 + 强化学习。这个配方让 AI 不再只是在单一任务上优化,而是开始具备跨领域泛化能力。结果是:刷榜越来越容易,但现实世界的价值并没有同步增长。

于是,AI 进入了“下半场”:重点不再是“能不能解决问题”,而是“我们该让 AI 解决什么问题、以及如何衡量真正有用的进步”。未来的核心,不是更难的考试和 benchmark,而是更贴近真实世界的评估方式:与人协作、长期记忆、持续改进、真实用户反馈。

在下半场,赢家不只是做模型的人,而是能把智能变成“产品”和“价值”的人。欢迎来到 AI 的第二个时代。




原文如下:

The Second Half

tldr: We’re at AI’s halftime.

For decades, AI has largely been about developing new training methods and models. And it worked: from beating world champions at chess and Go, surpassing most humans on the SAT and bar exams, to earning IMO and IOI gold medals. Behind these milestones in the history book — DeepBlue, AlphaGo, GPT-4, and the o-series — are fundamental innovations in AI methods: search, deep RL, scaling, and reasoning. Things just get better over time.

So what’s suddenly different now?

In three words: RL finally works. More precisely: RL finally generalizes. After several major detours and a culmination of milestones, we’ve landed on a working recipe to solve a wide range of RL tasks using language and reasoning. Even a year ago, if you told most AI researchers that a single recipe could tackle software engineering, creative writing, IMO-level math, mouse-and-keyboard manipulation, and long-form question answering — they’d laugh at your hallucinations. Each of these tasks is incredibly difficult and many researchers spend their entire PhDs focused on just one narrow slice.

Yet it happened.

So what comes next? The second half of AI — starting now — will shift focus from solving problems to defining problems. In this new era, evaluation becomes more important than training. Instead of just asking, “Can we train a model to solve X?”, we’re asking, “What should we be training AI to do, and how do we measure real progress?” To thrive in this second half, we’ll need a timely shift in mindset and skill set, ones perhaps closer to a product manager.

The first half

To make sense of the first half, look at its winners. What do you consider to be the most impactful AI papers so far?

I tried the quiz in Stanford 224N, and the answers were not surprising: Transformer, AlexNet, GPT-3, etc. What’s common about these papers? They propose some fundamental breakthroughs to train better models. But also, they managed to publish their papers by showing some (significant) improvements on some benchmarks.

There is a latent commonality though: these “winners” are all training methods or models, not benchmarks or tasks. Even arguably the most impactful benchmark of all, ImageNet, has less than one third of the citation of AlexNet. The contrast of method vs benchmark is even more drastic anywhere else —- for example, the main benchmark of Transformer is WMT’14, whose workshop report has ~1,300 citations, while Transformer had >160,000.

That illustrates the game of the first half: focus on building new models and methods, and evaluation and benchmark are secondary (although necessary to make the paper system work).

Why? A big reason is that, in the first half of AI, methods were harder and more exciting than tasks. Creating a new algorithm or model architecture from scratch – think of breakthroughs like the backpropagation algorithm, convolutional networks (AlexNet), or the Transformer used in GPT-3 – required remarkable insight and engineering. In contrast, defining tasks for AI often felt more straightforward: we simply took tasks humans already do (like translation, image recognition, or chess) and turned them into benchmarks. Not much insight or even engineering.

Methods also tended to be more general and widely applicable than individual tasks, making them especially valuable. For example, the Transformer architecture ended up powering progress in CV, NLP, RL, and many other domains – far beyond the single dataset (WMT’14 translation) where it first proved itself. A great new method can hillclimb many different benchmarks because it’s simple and general, thus the impact tends to go beyond an individual task.

This game has worked for decades and sparked world-changing ideas and breakthroughs, which manifested themselves by ever-increasing benchmark performances in various domains. Why would the game change at all? Because the cumulation of these ideas and breakthroughs have made a qualitative difference in creating a working recipe in solving tasks.

The recipe

What’s the recipe? Its ingredients, not surprisingly, include massive language pre-training, scale (in data and compute), and the idea of reasoning and acting. These might sound like buzzwords that you hear daily in SF, but why call them a recipe??

We can understand this by looking through the lens of reinforcement learning (RL), which is often thought of as the “end game” of AI — after all, RL is theoretically guaranteed to win games, and empirically it’s hard to imagine any superhuman systems (e.g. AlphaGo) without RL.

In RL, there are three key components: algorithm, environment, and priors. For a long time, RL researchers focused mostly on the algorithm (e.g. REINFORCE, DQN, TD-learning, actor-critic, PPO, TRPO…) – the intellectual core of how an agent learns – while treating the environment and priors as fixed or minimal. For example, Sutton and Barto’s classical textbook is all about algorithms and almost nothing about environments or priors.

However, in the era of deep RL, it became clear that environments matter a lot empirically: an algorithm’s performance is often highly specific to the environment it was developed and tested in. If you ignore the environment, you risk building an “optimal” algorithm that only excels in toy settings. So why don’t we first figure out the environment we actually want to solve, then find the algorithm best suited for it?

That’s exactly OpenAI’s initial plan. It built gym, a standard RL environment for various games, then the World of Bits and Universe projects, trying to turn the Internet or computer into a game. A good plan, isn’t it? Once we turn all digital worlds into an environment, solve it with smart RL algorithms, we have digital AGI.

A good plan, but not entirely working. OpenAI made tremendous progress down the path, using RL to solve Dotarobotic hands, etc. But it never came close to solving computer use or web navigation, and the RL agents working in one domain do not transfer to another. Something is missing.

Only after GPT-2 or GPT-3, it turned out that the missing piece is priors. You need powerful language pre-training to distill general commonsense and language knowledge into models, which then can be fine-tuned to become web (WebGPT) or chat (ChatGPT) agents (and change the world). It turned out the most important part of RL might not even be the RL algorithm or environment, but the priors, which can be obtained in a way totally unrelated from RL.

Language pre-training created good priors for chatting, but not equally good for controlling computers or playing video games. Why? These domains are further from the distribution of Internet text, and naively doing SFT / RL on these domains generalizes poorly. I noticed the problem in 2019, when GPT-2 just came out and I did SFT / RL on top of it to solve text-based games - CALM was the first agent in the world built via pre-trained language models. But it took millions of RL steps for the agent to hillclimb a single game, and it doesn’t transfer to new games. Though that’s exactly the characteristic of RL and nothing strange to RL researchers, I found it weird because we humans can easily play a new game and be significantly better zero-shot. Then I hit one of the first eureka moment in my life - we generalize because we can choose to do more than “go to cabinet 2” or “open chest 3 with key 1” or “kill dungeon with sword”, we can also choose to think about things like “The dungeon is dangerous and I need a weapon to fight with it. There is no visible weapon so maybe I need to find one in locked boxes or chests. Chest 3 is in Cabinet 2, let me first go there and unlock it”.

Thinking, or reasoning, is a strange kind of action - it does not directly affect the external world, yet the space of reasoning is open-ended and combintocially infinite — you can think about a word, a sentence, a whole passage, or 10000 random English words, but the world around you doesn’t immediate change. In the classical RL theory, it is a terrible deal and makes decision-making impossible. Imagine you need to choose one out of two boxes, and there’s only one box with $1M and the other one empty. You’re expected to earn $500k. Now imagine I add infinite empty boxes. You’re expected to earn nothing. But by adding reasoning into the action space of any RL environment, we make use of the language pre-training priors to generalize, and we afford to have flexible test-time compute for different decisions. It is a really magical thing and I apologize for not fully making sense of it here, I might need to write another blog post just for it. You’re welcome to read ReAct for the original story of reasoning for agents and read my vibes at the time. For now, my intuitive explanation is: even though you add infinite empty boxes, you have seen them throughout your life in all kinds of games, and choosing these boxes prepare you to better choose the box with money for any given game. My abstract explanation would be: language generalizes through reasoning in agents.

Once we have the right RL priors (language pre-training) and RL environment (adding language reasoning as actions), it turns out RL algorithm might be the most trivial part. Thus we have o-series, R1, deep research, computer-using agent, and so much more to come. What a sarcastic turn of events! For so long RL researchers cared about algorithms way more than environments, and no one paid any attention to priors — all RL experiments essentially start from scratch. But it took us decades of detours to realize maybe our prioritization should have be completely reversed.

But just like Steve Jobs said: You can’t connect the dots looking forward; you can only connect them looking backward.

The second half

This recipe is completely changing the game. To recap the game of the first half:

  • We develop novel training methods or models that hillclimb benchmarks.
  • We create harder benchmarks and continue the loop.

This game is being ruined because:

  • The recipe has essentially standardized and industried benchmark hillclimbing without requiring much more new ideas. As the recipe scales and generalizes well, your novel method for a particular task might improve it by 5%, while the next o-series model improve it by 30% without explicitly targeting it.
  • Even if we create harder benchmarks, pretty soon (and increasingly soon) they get solved by the recipe. My colleague Jason Wei made a beautiful figure to visualize the trend well:

Then what’s left to play in the second half? If novel methods are no longer needed and harder benchmarks will just get solved increasingly soon, what should we do?

I think we should fundamentally re-think evaluation. It means not just to create new and harder benchmarks, but to fundamentally question existing evaluation setups and create new ones, so that we are forced to invent new methods beyond the working recipe. It is hard because humans have inertia and seldom question basic assumptions - you just take them for granted without realizing they are assumptions, not laws.

To explain inertia, suppose you invented one of the most successful evals in history based on human exams. It was an extremely bold idea in 2021, but 3 years later it’s saturated. What would you do? Most likely create a much harder exam. Or suppose you solved simply coding tasks. What would you do? Most likely find harder coding tasks to solve until you have reached IOI gold level.

Inertia is natural, but here is the problem. AI has beat world champions at chess and Go, surpassed most humans on SAT and bar exams, and reached gold medal level on IOI and IMO. But the world hasn’t changed much, at least judged by economics and GDP.

I call this the utility problem, and deem it the most important problem for AI.

Perhaps we will solve the utility problem pretty soon, perhaps not. Either way, the root cause of this problem might be deceptively simple: our evaluation setups are different from real-world setups in many basic ways. To name two examples:

  • Evaluation “should” run automatically
    , so typically an agent receives a task input, do things autonomously, then receive a task reward. But in reality, an agent has to engage with a human throughout the task — you don’t just text customer service a super long message, wait for 10 minutes, then expect a detailed response to settle everything. By questioning this setup, new benchmarks are invented to either engage real humans (e.g. Chatbot Arena) or user simulation (e.g. tau-bench) in the loop. 
  • Evaluation “should” run i.i.d.
     If you have a test set with 500 tasks, you run each task independently, average the task metrics, and get an overall metric. But in reality, you solve tasks sequentially rather than in parallel. A Google SWE solves google3 issues increasingly better as she gets more familiar with the repo, but a SWE agent solves many issues in the same repo without gaining such familiarity. We obviously need long-term memory methods (and there are), but academia does not have the proper benchmarks to justify the need, or even the proper courage to question i.i.d. assumption that has been the foundation of machine learning.

These assumptions have “always” been like this, and developing benchmarks in these assumptions were fine in the first half of AI, because when the intelligence is low, improving intelligence generally improves utility. But now, the general recipe is guaranteed to work under these assumptions. So the way to play the new game of the second half is

  • We develop novel evaluation setups or tasks for real-world utility.
  • We solve them with the recipe or augment the recipe with novel components. Continue the loop.

This game is hard because it is unfamiliar. But it is exciting. While players in the first half solve video games and exams, players in the second half get to build billion or trillion dollar companies by building useful products out of intelligence. While the first half is filled with incremental methods and models, the second half filters them to some degree. The general recipe would just crush your incremental methods, unless you create new assumptions that break the recipe. Then you get to do truly game-changing research.

Welcome to the second half!

Acknowledgements

This blog post is based on my talk given at Stanford 224N and Columbia. I used OpenAI deep research to read my slides and write a draft.

Written on April 10, 2025



This article is reproduced from Yao Shunyu's blog (https://ysymyth.github.io/The-Second-Half) and is intended solely for knowledge exchange purposes, not for commercial use. If any infringement issues arise, please feel free to contact us for removal.





译文如下:

下半场  

简而言之:我们正处于人工智能发展历程的中场休息阶段。

几十年来,人工智能主要致力于开发新的训练方法和模型。这种策略成效显著:从在国际象棋和围棋领域击败世界冠军,到在SAT和律师资格考试中超越多数人类,再到斩获国际数学奥林匹克和国际信息学奥林匹克金牌。这些载入史册的里程碑——深蓝、AlphaGo、GPT-4和o系列——背后,是人工智能方法论的根本性创新:搜索算法、深度强化学习、规模扩展和推理能力。 随着时间推移,一切都在不断进步。

那么现在突然有什么不同了呢?

简而言之:强化学习终于奏效了。更准确地说:强化学习终于实现了泛化能力。历经多次重大迂回与里程碑式的突破,我们终于找到了一套可行的方案,能够运用语言与推理能力解决各类强化学习任务。 若在一年前告诉多数人工智能研究者,单一方案竟能同时攻克软件工程、创意写作、国际数学奥林匹克级别的数学问题、鼠标键盘操作以及长篇问答——他们定会嘲笑你这是在做白日梦。这些任务各自都极其艰巨,许多研究者甚至耗费整个博士生涯只专注于其中某个狭窄领域。

  然而,它发生了。  

那么接下来会发生什么?人工智能的下半场——从此刻开始——将把焦点从解决问题转向定义问题。 在这个新时代,评估比训练更为重要。我们不再仅仅追问"能否训练模型解决X问题?",而是开始思考"我们该让AI训练做什么?如何衡量真实进展?"。要在这个新阶段蓬勃发展,我们需要及时转变思维模式和技能体系——这些转变或许更接近产品经理的思维方式。

  上半场  

要理解上半场的情况,不妨看看获奖者。你认为迄今为止最具影响力的AI论文有哪些?

我在斯坦福大学224N课程中尝试了测验,答案并不意外:Transformer、AlexNet、GPT-3等。这些论文有何共同点?它们提出了一些根本性突破,以训练更优模型。但同时,它们也通过在某些基准测试中展现显著改进,成功发表了论文。

然而存在一个潜在的共同点:这些"赢家"都是训练方法或模型,而非基准测试或任务。即便被公认为最具影响力的基准测试ImageNet,其被引用次数也不及AlexNet的三分之一。 方法与基准的对比在其他领域更为鲜明——例如Transformer的主要基准是WMT'14,其研讨会报告约有1300次引用,而Transformer本身则超过16万次。

这恰恰说明了上半场的游戏规则:重点在于构建新的模型和方法,而评估与基准测试则次之(尽管对维持论文体系的运转必不可少)。

为什么?一个重要原因是,在人工智能发展的前半程,方法比任务更艰巨也更令人兴奋。从零开始创建新算法或模型架构——想想反向传播算法、卷积神经网络(AlexNet)或GPT-3中使用的Transformer等突破性成果——需要非凡的洞察力和工程能力。 相比之下,为AI定义任务往往显得更为直接:我们只需将人类已有的任务(如翻译、图像识别或国际象棋)转化为基准测试,无需太多洞察力,甚至无需太多工程能力。

方法通常比具体任务更具普适性,适用范围更广,因而尤为珍贵。例如Transformer架构最终推动了计算机视觉、自然语言处理、强化学习等众多领域的进步——其影响力远超最初证明自身价值的单一数据集(WMT'14翻译数据集)。优秀的新方法因其简单而通用,能够在众多不同基准测试中实现渐进式提升,因此其影响往往超越单一任务范畴。

这场游戏已持续数十年,催生了改变世界的创意与突破,这些成果通过各领域不断攀升的基准性能得以体现。为何游戏规则需要改变?因为这些创意与突破的累积,在解决任务方面创造了一套行之有效的方案,实现了质的飞跃。

  食谱  

配方是什么?毫不意外,其原料包括大规模语言预训练、规模(数据与计算资源)以及推理与行动的能力。这些听起来或许像是你在旧金山每天听到的流行词汇,但为何称之为配方呢?

通过强化学习(RL)的视角,我们可以理解这一点。强化学习常被视为人工智能的"终极目标"——毕竟,从理论上讲,强化学习能确保获胜;而从实践角度看,很难想象任何不基于强化学习的超人类系统(如AlphaGo)。

在强化学习中,存在三个关键组件:算法、环境和先验知识。 长期以来,RL研究者主要关注算法(如REINFORCE、DQN、TD学习、演员-评论家、PPO、TRPO等)——这是智能体学习的核心机制——而将环境和先验知识视为固定或次要因素。例如,萨顿和巴托的经典教科书几乎只讨论算法,几乎未涉及环境或先验知识。

然而在深度强化学习时代,经验表明环境因素至关重要:算法的性能往往高度依赖于其开发和测试的环境。若忽视环境因素,就可能构建出仅在简化场景下表现优异的"最优"算法。那么,我们为何不先明确实际需要解决的环境问题,再寻找最适合该环境的算法呢?

这正是OpenAI最初的计划。它先构建了Gym——一个适用于各类游戏的标准强化学习环境,随后推出"位世界"和"宇宙"项目,试图将互联网或计算机转化为游戏。这个计划很棒,不是吗?一旦我们将所有数字世界转化为环境,再用智能的强化学习算法解决问题,数字通用人工智能就诞生了。

计划虽好,却未能完全奏效。OpenAI在该领域取得了巨大进展,运用强化学习解决了Dota游戏、机械手等问题。但它始终未能攻克计算机操作或网页导航等任务,且不同领域中的强化学习代理无法实现迁移。其中必有关键要素缺失。

直到GPT-2或GPT-3出现后,人们才发现缺失的关键环节在于先验知识。 需要强大的语言预训练来将通用常识和语言知识蒸馏进模型,这些模型随后可通过微调成为网络(WebGPT)或聊天(ChatGPT)智能体(并改变世界)。事实证明,强化学习中最关键的部分或许并非算法或环境,而是先验知识——而获取这些知识的方式可能与强化学习完全无关。

语言预训练为聊天任务建立了良好的先验知识,但对控制计算机或玩电子游戏的效果却不尽如人意。原因何在?这些领域与互联网文本的分布差异较大,若直接在这些领域进行监督式预训练(SFT)/强化学习(RL),其泛化能力往往较差。 早在2019年GPT-2刚问世时,我就发现了这个问题——当时我基于该模型进行SFT/RL训练以解决文字类游戏,CALM成为全球首个通过预训练语言模型构建的智能体。 但该智能体需耗费数百万次RL步骤才能完成单个游戏的爬坡学习,且无法迁移至新游戏。尽管这正是RL的固有特性,对RL研究者而言并不陌生,我却感到困惑——人类面对新游戏时能轻松实现零样本学习并表现优异。 随后我迎来了人生中最早的顿悟时刻——人类之所以能泛化,是因为我们不仅能执行"前往柜子2"、"用钥匙1打开箱子3"或"用剑击杀地牢"这类指令,更能进行抽象思考:"地牢危险,我需要武器对抗它。 眼前没有武器,或许得从上锁的箱子或宝箱里寻找。宝箱3在柜子2里,先去那里开锁吧"。

思考或推理是一种奇特的行动——它不会直接影响外部世界,但推理的空间却是开放的、组合上无限的:你可以思考一个词、一句句子、整段文字,或是10000个随机的英语单词,但你周围的世界不会立即改变。在经典的强化学习理论中,这简直是笔糟糕的交易,使得决策变得不可能。 设想你需从两个箱子中选取一个:其中仅有一个装有100万美元,另一个则空无一物。 按预期你将获得50万美元。现在假设我添加无限个空箱子——预期收益归零。但通过在强化学习环境的动作空间中引入推理能力,我们得以利用语言预训练先验进行泛化,并在测试阶段为不同决策提供灵活的计算支持。这堪称神奇的突破,恕我在此未能详尽阐释,或许需要另撰博文专门探讨。 欢迎阅读ReAct论文了解智能体推理的原始理论,并感受我当时的学术脉动。目前我的直观解释是:即便添加无限空箱,你已在各类游戏中见过它们,选择这些箱子能让你在任何游戏中更精准地选中装有金钱的箱子。我的抽象解释则是:语言通过智能体的推理能力实现泛化。

一旦我们拥有了正确的强化学习先验(语言预训练)和强化学习环境(将语言推理作为动作添加),强化学习算法反而可能成为最简单的部分。于是我们有了o-series、R1、深度研究、计算机使用代理,以及更多即将到来的突破。 多么讽刺的转折!长期以来,RL研究者对算法的关注远胜于环境,而先验知识却无人问津——所有RL实验本质上都是从零开始。但我们花了数十年的迂回探索才意识到:或许我们的优先级本该彻底颠倒。

但正如史蒂夫·乔布斯所言:你无法预见未来,只能在回顾时串联起人生的点点滴滴。

  下半场  

这道食谱彻底改变了游戏规则。回顾上半场的比赛:

  • 我们开发了新型训练方法或模型,这些方法或模型能够在基准测试中实现爬坡式改进。
  • 我们制定更严苛的基准,并持续循环推进。

这款游戏正在被毁掉,因为:

  • 该方法论本质上已将基准爬坡法标准化并工业化,无需太多新思路。由于其具备良好的可扩展性和泛化能力,针对特定任务的新颖方法或许能提升5%性能,而后续的o级模型则能在不刻意针对该任务的情况下实现30%的提升。
  • 即使我们设置更严苛的基准测试,很快(且越来越快)它们就会被现有方案解决。我的同事魏杰森制作了一张精美的图表,生动地呈现了这一趋势:

那么下半场还有什么可玩的?如果不再需要创新方法,更难的基准测试也只会越来越快地被攻克,我们该怎么办?

我认为我们应当从根本上重新思考评估体系。这不仅意味着制定更严苛的新标准,更要彻底质疑现有的评估机制并建立全新框架,从而迫使我们突破现有方法的局限,开创崭新的评估途径。这过程充满挑战,因为人类具有惯性思维,鲜少质疑基本假设——人们往往将其视为理所当然,却未曾意识到这些不过是假设而非定律。

要解释惯性,假设你发明了史上最成功的基于人类考试的评估体系。2021年这曾是极具前瞻性的构想,但三年后市场已趋于饱和。你会怎么做?很可能设计难度更高的考试。或者假设你专注解决基础编程任务。你会怎么做?很可能不断挑战更高难度的编程任务,直至达到IOI金牌水平。

惯性是自然的,但问题在于:人工智能已在国际象棋和围棋领域击败世界冠军,在SAT和律师资格考试中超越多数人类,更在国际信息学奥林匹克竞赛和国际数学奥林匹克竞赛中摘得金牌。然而世界并未发生太大变化——至少从经济和GDP来看是如此。

我称之为实用性问题,并认为这是人工智能领域最关键的问题。

或许我们很快就能解决实用性问题,或许不能。无论如何,这个问题的根源可能看似简单:我们的评估环境在许多基本方面与现实环境存在差异。举两个例子:

  • 评估“应当”自动运行,因此通常代理接收任务输入后,会自主执行操作,随后获得任务奖励。但现实中,代理必须在整个任务过程中与人类互动——你不会只是给客服发送超长消息,等待十分钟后就期待得到详尽回复解决所有问题。 通过质疑这种设计模式,人们创造了新的基准测试方法:或引入真实人类参与(如Chatbot Arena),或采用用户模拟(如tau-bench)进行循环验证。
  • 评估“应当”在独立同分布(i.i.d.)条件下运行。若测试集包含500项任务,应独立运行每项任务,取任务指标的平均值,从而获得总体指标。但现实中,任务是顺序解决而非并行处理。谷歌软件工程师(SWE)在熟悉仓库后能逐步提升google3问题的解决能力,而SWE代理在处理同一仓库的众多问题时却无法获得这种熟悉度。 我们显然需要长时记忆方法(且已有相关技术),但学术界既缺乏证明其必要性的合适基准测试,甚至缺乏质疑机器学习根基——独立同分布假设的勇气。

这些假设“始终”如此,在人工智能发展的前半程,基于这些假设制定基准是可行的——因为当智能水平较低时,提升智能通常能增强实用性。但如今,通用方案在这些假设下已能确保奏效。因此,在后半程的新游戏中,我们需要采取的策略是:

  • 我们开发用于现实世界实用性的新型评估方案或任务。
  • 我们通过配方解决这些问题,或为配方增添新颖成分。循环往复。

这款游戏之所以艰难,是因为它充满陌生感。但它同样令人兴奋。上半场玩家们忙于破解电子游戏和应对考试,下半场玩家则能通过运用智慧打造实用产品,建立价值数十亿乃至数万亿美元的企业。上半场充斥着渐进式方法与模型,而下半场则会对这些方法进行某种程度的筛选。常规方法论会彻底碾压你的渐进式方法,除非你创造出打破常规的新假设——届时你便能开展真正颠覆性的研究。

欢迎来到下半场!

  致谢  

本文基于我在斯坦福大学224N课程和哥伦比亚大学的演讲内容。我使用OpenAI深度研究工具来阅读幻灯片并撰写初稿。

写于2025年4月10日




本文转自姚顺雨的博客(https://ysymyth.github.io/The-Second-Half),仅作知识交流使用,不涉及商业用途。若存在侵权问题,可随时联系我们删除。


【声明】内容源于网络
0
0
知更小筑
内容 0
粉丝 0
知更小筑
总阅读0
粉丝0
内容0