推荐人
石允丰 五源执行总监
Scaling law 让我们吃尽了互联网免费数据的老本。这些数据足以攻克经典的NLP任务,却不足以让模型成为通用可靠的agent。假设用互联网诞生前的所有文本数据训练 GPT4o 即便算力足够,数据也远远不够。
回顾过去20年的“bitter lesson”,架构上的改进始终只是小步慢跑,而数据驱动的创新却能产生巨大影响。
我预期这一趋势仍将持续。今天的强化学习(RL)仍处在类似的 “前 GPT3 时代” 范式: RL 数据集非常袖珍。例如,DeepSeekR1 只用了约 60 万道数学题做 RL 训练——如果每道题人类要花 5 分钟完成,相当于 6 年持续人力。相比之下,重建 GPT3 那 3000 亿 token 的训练语料,按正常人写作速度需要数万年级别的写作量。
构建评测与 RL 环境,是对人类时间最高杠杆、最持久的利用方式。
欢迎来到评测时代
作者:Brendan Foody
原文链接:
blog/welcometotheeraofevals/
强化学习(Reinforcement Learning, RL)正推动人工智能领域最令人兴奋的突破。随着 RL 效果突飞猛进,模型很快就会“刷爆”一切既有评测。这意味着,要把智能体真正部署到整个经济体,唯一的拦路虎只剩下一件事——为万事万物构建评测(evals)。
然而,AI 实验室正遭遇一场“评测饥荒”:学术界那套被当成 Benchmark 的评测,跟消费者和企业真正需要的任务严重脱节。
evals将成为新的PRD。在加速知识型工作的过程中,发展将趋于统一方向:构建能够映射真实工作场景和交付物的环境与评估系统。这种以 RL 为中心的新型人类数据范式,比预训练(pretraining)、监督微调(SFT)或 RLHF 更加高效。大多数知识工作本质上都包含可重复的工作流程,是一种变动成本,而如果将其封装为环境或评估系统,就能将其转化为一次性的固定成本。
用“可验证奖励”做训练
RL 环境允许给最终结果和中间步骤同时打分。模型会多次尝试同一道题,利用测试时算力“先想后答”。人类编写的自动评分器(autograder)会奖励那些“思路正确”的轨迹。不断在这些“好轨迹”上做强化,就能让模型学会用正确的思维链解决各类问题——研究者在评测上持续“爬坡”即可。这些环境按“可验证程度”大致落在一条光谱上:
·客观领域(Objective domains)
比如游戏(PacMan、国际象棋、围棋等),具有清晰的状态空间、动作空间和目标结果。数学、编程,甚至生物学中的某些任务,也可以以近乎游戏化的方式进行验证。客观领域是RL已大获全胜的战场:AlphaProof、AlphaFold、DeepSeekR1 以及众多代码生成模型。
·主观领域(Subjective domains)
在现实世界中,有些任务的准确性很难量化,比如撰写投资备忘录、准备法律文件、提供心理治疗等。这类任务难以判断模型是否实现了预期目标。此外,专家之间对“理想过程和结果”的看法常常多样并共存。在此类情境中,基于评分标准的奖励机制(rubricbased rewards)能从人类专家意见的复杂性中学习。这类基于评分标准构建环境并进行训练的方法,是一个颇具前景的研究方向,其早期基础可追溯至 Anthropic 的宪法 AI(Constitutional AI)和 RLAIF 项目。
·计算机使用智能体:介于两者之间(Computeruse agents)
大多数人类在电脑上的任务,其目标初看模糊但一旦定义清晰,行为和结果都可以通过程序方式验证。这些任务包括规划旅行、回复邮件、网购、社交媒体发帖等。通过容器化环境,可以让成千上万条并行交互在线学习,横向扩展几乎无上限。
环境即经验
最终,我们的 AI 系统将能够从现实世界的信号中自动学习,比如学生的考试成绩提高了、销售成交了,甚至桥梁被建成了。但即使如此,中间奖励永远不可或缺。就像人类从他人那里学习一样,模型也需要引导,理解哪些教学方式或销售策略更有效。人类仍将是模型学习环境中不可替代的一部分。
我们永远无法逃离“数据时代”;它必须跟随我们走向前沿。而前沿,正是人类亲手构建的、可持续供给经验数据的环境。这些环境既用来训练,也用来评测。
前行路径
要满足当下的数据饥渴,必须重新思考“如何从人类劳动中提炼信号”。构建评测与 RL 环境,是对人类时间最高杠杆、最持久的利用方式。Mercor 已率先用自动评分器打造环境,并在“模拟工作空间、多轮交互、多模态”等维度持续拓展 RL 数据的边界。
知识工作将迅速收敛到一件事:为智能体打造 RL 环境与评测,让它们在其中学习、进化。当 AI 真正走进职场、涉入专有信息、置身独特职业语境时,这些环境就把知识与目标一并“编纂”给了智能体。一旦工作流程的每一步都足够可靠,剩下的唯一任务,就是让人类设定的目标成为 RL 训练的北极星。
以下为原文:
Reinforcement Learning (RL) is driving the most exciting advancements in AI. RL is becoming so effective that models will be able to saturate any evaluation. This means that the primary barrier to applying agents to the entire economy is building evals for everything. However, AI labs are facing a dire shortage of relevant evaluations. Academic evaluations that labs goal on don’t reflect what consumers and enterprises demand in the economy.evals are the new PRD. Progress in accelerating knowledge work will converge on building environments and evaluations that map real workspaces and deliverables. This new RLcentric paradigm of human data is vastly more data efficient than pretraining, SFT, or RLHF. Most knowledge work includes recurring workflows as variable costs, but creating an environment or evaluation can transform that into a onetime fixed cost.Training on Verifiable RewardsRL environments allow for rewarding outcomes and intermediate steps in an evaluation. Models take many attempts at a problem, using testtime compute to think before it answers. Human created autograders reward the attempts which were good . Reinforcing on those good trajectories upweights the chains of thought that were used to get to the answer. This teaches models to think correctly about different types of problems as researchers iteratively hill climb evals.These environments can be thought of as existing on a spectrum of rigidity between two categories:
Objective domains: Games, like pacman, chess, and Go, have clear states spaces, action spaces, and desired outcomes. Math, code, and even some tasks in biology, can often be formulated with near gamelike verifiability. This is where RL has achieved early massive success already, notably, AlphaProof, AlphaFold, and DeepSeek R1 and the many code generation models on the market today.
Subjective domains: It’s more difficult to measure accuracy in many real world tasks such as generating investment memos, making legal briefs, providing therapy. This makes it difficult to verify that a model achieved desired outcomes. Additionally, experts often support multiple valid opinions about desired processes and outcomes. Rubricbased rewards serve as a way to learn from the messiness of expert human opinions. How to evaluate and train with rubrics as environments is an exciting area of research with roots laid as early as constitutional AI and RLAIF work from Anthropic.
Computeruse agents sit somewhere in the middle. For most of the tasks humans do on computers, goals start to become ambiguous and multifaceted. once defined, the actions and outcomes are programmatic and verifiable. These could include planning trips, responding to emails, shopping, or posting on social media. In all of these cases, containerized environments allow for horizontal scaling to learn online from thousands of interactions in parallel.Environments Create ExperienceEventually, our AI systems will learn automatically from signals in the real world like pupils’ test scores increasing, sales closing, maybe even bridges being built. However, intermediate rewards will always remain critical. Similar to how humans learn from other people, models will need guidance on which styles of teaching and sales techniques are most effective. Humans will remain an integral part of the environments models learn from.We will never escape the era of data; it must follow us to the frontier. That frontier is human created environments that provide durable sources of experiential data. These environments can serve to train and evaluate models.The Path ForwardMeeting today’s data demand requires rethinking the way we generate signal from human efforts. Creating evals and RL environments is the highest leverage and most durable use of people’s time. Mercor has helped pioneer environment generation using autograders and continues to push the boundaries of RL data with simulated workspaces, multiturn support, and multimodality.Knowledge work will quickly converge on building RL environments and evaluations for agents to learn from. As AI enters the workforce and operates over proprietary information and under unique professional contexts, these environments codify knowledge and goals for agents. once individual steps of agentic workflows reach sufficient reliability, all that will be left will be RL training on the goals laid out by humankind.
(转自:五源资本 5Y Capital)

 
  