当前位置：首页 > news >正文

[reinforcement learning] 是什么 | 应用场景 | Andrew Barto and Richard Sutton

news 来源：原创 2025/9/3 3:55:56

什么是强化学习？

强化学习的应用场景

广告和推荐

对话系统

强化学习的主流算法

纽约时报：Turing Award Goes to 2 Pioneers of Artificial Intelligence

wiki

资料混合：youtube, wiki, github

今天下午上课刷到了不少，整合放一起叭，之后有时间每天了解一点(´･ω･`)

2024.5 图灵奖颁给了两位强化学习先驱：Andrew Barto and Richard Sutton

For developing the conceptual and algorithmic foundations of reinforcement learning
未来将会是怎样的呢？

什么是强化学习？

强化学习并不是某一种特定的算法，而是一类算法的统称。
如果用来做对比的话，他跟监督学习，无监督学习是类似的，是一种统称的学习方式。

强化学习算法的思路非常简单，以游戏为例，如果在游戏中采取某种策略可以取得较高的得分，那么就进一步“强化”这种策略，以期继续取得较好的结果。这种策略与日常生活中的各种“绩效奖励”非常类似。我们平时也常常用这样的策略来提高自己的游戏水平。
在 Flappy bird 这个游戏中，我们需要简单的点击操作来控制小鸟，躲过各种水管，飞的越远越好，因为飞的越远就能获得更高的积分奖励。

这就是一个典型的强化学习场景：

机器有一个明确的小鸟角色——代理
需要控制小鸟飞的更远——目标
整个游戏过程中需要躲避各种水管——环境
躲避水管的方法是让小鸟用力飞一下——行动
飞的越远，就会获得越多的积分——奖励

你会发现，强化学习和监督学习、无监督学习最大的不同就是不需要大量的“数据喂养”。而是通过自己不停的尝试来学会某些技能。

强化学习的应用场景

强化学习目前还不够成熟，应用场景也比较局限。最大的应用场景就是游戏了。

游戏

2016年：AlphaGo Master 击败李世石，使用强化学习的 AlphaGo Zero 仅花了40天时间，就击败了自己的前辈 AlphaGo Master。

《被科学家誉为“世界壮举”的AlphaGo Zero, 对普通人意味着什么？》

2019年1月25日：AlphaStar 在《星际争霸2》中以 10：1 击败了人类顶级职业玩家。

《星际争霸2人类1:10输给AI！DeepMind “AlphaStar”进化神速》

2019年4月13日：OpenAI 在《Dota2》的比赛中战胜了人类世界冠军。

《2:0！Dota2世界冠军OG，被OpenAI按在地上摩擦》

机器人

机器人很像强化学习里的“代理”，在机器人领域，强化学习也可以发挥巨大的作用。

《机器人通过强化学习，可以实现像人一样的平衡控制》

《深度学习与强化学习相结合，谷歌训练机械臂的长期推理能力》

《伯克利强化学习新研究：机器人只用几分钟随机数据就能学会轨迹跟踪》

其他

强化学习在推荐系统，对话系统，教育培训，广告，金融等领域也有一些应用：

《强化学习与推荐系统的强强联合》

《基于深度强化学习的对话管理中的策略自适应》

《强化学习在业界的实际应用》

广告和推荐

图片来源：A Reinforcement Learning Framework for Explainable Recommendation

对话系统

图片来源：End-to-End Task-Completion Neural Dialogue Systems

强化学习的主流算法

免模型学习（Model-Free） vs 有模型学习（Model-Based）

在介绍详细算法之前，我们先来了解一下强化学习算法的2大分类。这2个分类的重要差异是：智能体是否能完整了解或学习到所在环境的模型

有模型学习（Model-Based）对环境有提前的认知，可以提前考虑规划，但是缺点是如果模型跟真实世界不一致，那么在实际使用场景下会表现的不好。

免模型学习（Model-Free）放弃了模型学习，在效率上不如前者，但是这种方式更加容易实现，也容易在真实场景下调整到很好的状态。所以免模型学习方法更受欢迎，得到更加广泛的开发和测试。

除了免模型学习和有模型学习的分类外，强化学习还有其他几种分类方式：

基于概率 VS 基于价值
回合更新 VS 单步更新
在线学习 VS 离线学习

纽约时报：Turing Award Goes to 2 Pioneers of Artificial Intelligence

Andrew Barto and Richard Sutton developed reinforcement learning, a technique vital to chatbots like ChatGPT.
Andrew Barto 和 Richard Sutton 开发了强化学习，这是一种对 ChatGPT 等聊天机器人至关重要的技术。

In 1977, Andrew Barto, as a researcher at the University of Massachusetts, Amherst, began exploring a new theory that neurons behaved like hedonists. The basic idea was that the human brain was driven by billions of nerve cells that were each trying to maximize pleasure and minimize pain.
1977 年，马萨诸塞大学阿默斯特分校的研究员安德鲁·巴托（Andrew Barto）开始探索一种新理论，即神经元的行为类似于享乐主义者。基本思想是，人脑由数十亿个神经细胞驱动，每个神经细胞都试图最大限度地提高快乐和减少痛苦。

A year later, he was joined by another young researcher, Richard Sutton. Together, they worked to explain human intelligence using this simple concept and applied it to artificial intelligence. The result was “reinforcement learning,” a way for A.I. systems to learn from the digital equivalent of pleasure and pain.
一年后，另一位年轻的研究人员理查德·萨顿（Richard Sutton）加入了他的行列。他们一起努力使用这个简单的概念来解释人类智能，并将其应用于人工智能。结果是“强化学习”，一种人工智能系统从数字等价物的快乐和痛苦中学习的方法。

On Wednesday, the Association for Computing Machinery, the world’s largest society of computing professionals, announced that Dr. Barto and Dr. Sutton had won this year’s Turing Award for their work on reinforcement learning. The Turing Award, which was introduced in 1966, is often called the Nobel Prize of computing. The two scientists will share the $1 million prize that comes with the award.
周三，世界上最大的计算机专业人士协会（Association for Computing Machinery）宣布，巴托博士和萨顿博士因其在强化学习方面的工作而获得了今年的图灵奖。图灵奖于 1966 年推出，通常被称为计算界的诺贝尔奖。这两位科学家将分享该奖项附带的 100 万美元奖金。

Over the past decade, reinforcement learning has played a vital role in the rise of artificial intelligence, including breakthrough technologies such as Google’s AlphaGo and OpenAI’s ChatGPT. The techniques that powered these systems were rooted in the work of Dr. Barto and Dr. Sutton.
在过去的十年中，强化学习在人工智能的兴起中发挥了至关重要的作用，包括谷歌的 AlphaGo 和 OpenAI 的 ChatGPT 等突破性技术。为这些系统提供动力的技术植根于 Barto 博士和 Sutton 博士的工作。

“They are the undisputed pioneers of reinforcement learning,” said Oren Etzioni, a professor emeritus of computer science at the University of Washington and founding chief executive of the Allen Institute for Artificial Intelligence. “They generated the key ideas — and they wrote the book on the subject.”
“他们是强化学习无可争议的先驱，”华盛顿大学（University of Washington）计算机科学名誉教授、艾伦人工智能研究所（Allen Institute for Artificial Intelligence）的创始首席执行官奥伦·埃齐奥尼（Oren Etzioni）说。“他们提出了关键思想——他们写了一本关于这个主题的书。”

Their book, “Reinforcement Learning: An Introduction,” which was published in 1998, remains the definitive exploration of an idea that many experts say is only beginning to realize its potential.
他们的著作《强化学习：导论》（Reinforcement Learning： An Introduction）于 1998 年出版，至今仍是对这一观点的权威探索，许多专家表示，这一观点才刚刚开始实现其潜力。

Psychologists have long studied the ways that humans and animals learn from their experiences. In the 1940s, the pioneering British computer scientist Alan Turing suggested that machines could learn in much the same way.
心理学家长期以来一直在研究人类和动物从他们的经历中学习的方式。在 1940 年代，英国计算机科学家先驱艾伦·图灵（Alan Turing）提出，机器可以以大致相同的方式学习。

But it was Dr. Barto and Dr. Sutton who began exploring the mathematics of how this might work, building on a theory that A. Harry Klopf, a computer scientist working for the government, had proposed. Dr. Barto went on to build a lab at UMass Amherst dedicated to the idea, while Dr. Sutton founded a similar kind of lab at the University of Alberta in Canada.
但正是巴托博士和萨顿博士开始探索这如何运作的数学原理，他们以为政府工作的计算机科学家 A. Harry Klopf 提出的理论为基础。巴托博士继续在马萨诸塞大学阿默斯特分校建立了一个专门研究这个想法的实验室，而萨顿博士在加拿大阿尔伯塔大学建立了一个类似的实验室。

“It is kind of an obvious idea when you’re talking about humans and animals,” said Dr. Sutton, who is also a research scientist at Keen Technologies, an A.I. start-up, and a fellow at the Alberta Machine Intelligence Institute, one of Canada’s three national A.I. labs. “As we revived it, it was about machines.”
“当你谈论人类和动物时，这是一个显而易见的想法，”萨顿博士说，他也是人工智能初创公司Keen Technologies的研究科学家，也是加拿大三个国家人工智能实验室之一的阿尔伯塔省机器智能研究所（Alberta Machine Intelligence Institute）的研究员。“当我们复兴它时，它与机器有关。”

This remained an academic pursuit until the arrival of AlphaGo in 2016. Most experts believed that another 10 years would pass before anyone built an A.I. system that could beat the world’s best players at the game of Go.
在 2016 年 AlphaGo 到来之前，这仍然是一个学术追求。大多数专家认为，再过 10 年，才会有人构建出可以在围棋比赛中击败世界上最好的棋手的人工智能系统。

But during a match in Seoul, South Korea, AlphaGo beat Lee Sedol, the best Go player of the past decade. The trick was that the system had played millions of games against itself, learning by trial and error. It learned which moves brought success (pleasure) and which brought failure (pain).
但在韩国首尔的一场比赛中，AlphaGo 击败了过去十年中最好的围棋选手李世石。诀窍在于，该系统已经与自己对弈了数百万次，通过反复试验来学习。它了解哪些动作会带来成功（快乐），哪些动作会带来失败（痛苦）。

The Google team that built the system was led by David Silver, a researcher who had studied reinforcement learning under Dr. Sutton at the University of Alberta.
构建该系统的 Google 团队由大卫·西尔弗（David Silver）领导，他是一名研究员，曾在阿尔伯塔大学（University of Alberta）的萨顿（Sutton）博士的指导下研究强化学习。

Many experts still question whether reinforcement learning could work outside of games. Game winnings are determined by points, which makes it easy for machines to distinguish between success and failure.
许多专家仍然质疑强化学习是否可以在游戏之外发挥作用。游戏赢利由积分决定，这使得机器很容易区分成功和失败。

But reinforcement learning has also played an essential role in online chatbots.
但强化学习在在线聊天机器人中也发挥了重要作用。

Leading up to the release of ChatGPT in the fall of 2022, OpenAI hired hundreds of people to use an early version and provide precise suggestions that could hone its skills. They showed the chatbot how to respond to particular questions, rated its responses and corrected its mistakes. By analyzing those suggestions, ChatGPT learned to be a better chatbot.
在 2022 年秋季发布 ChatGPT 之前，OpenAI 聘请了数百人使用早期版本并提供可以磨练其技能的精确建议。他们向聊天机器人展示了如何回答特定问题，对其回答进行评分并纠正错误。通过分析这些建议，ChatGPT 学会了成为一个更好的聊天机器人。

Researchers call this “reinforcement learning from human feedback,” or R.L.H.F. And it is one of the key reasons that today’s chatbots respond in surprisingly lifelike ways.
研究人员称之为“来自人类反馈的强化学习”，或 R.L.H.F.。这也是当今聊天机器人以令人惊讶的逼真方式做出响应的关键原因之一。

(The New York Times has sued OpenAI and its partner, Microsoft, for copyright infringement of news content related to A.I. systems. OpenAI and Microsoft have denied those claims.)
（《纽约时报》起诉 OpenAI 及其合作伙伴 Microsoft 侵犯与 AI 系统相关的新闻内容的版权。OpenAI 和 Microsoft 否认了这些指控。

More recently, companies like OpenAI and the Chinese start-up DeepSeek have developed a form of reinforcement learning that allows chatbots to learn from themselves — much as AlphaGo did. By working through various math problems, for instance, a chatbot can learn which methods lead to the right answer and which do not.
最近，OpenAI 和中国初创公司 DeepSeek 等公司开发了一种强化学习形式，允许聊天机器人从自己身上学习——就像 AlphaGo 所做的那样。例如，通过解决各种数学问题，聊天机器人可以学习哪些方法会导致正确答案，哪些方法不会。

If it repeats this process with an enormously large set of problems, the bot can learn to mimic the way humans reason — at least in some ways. The result is so-called reasoning systems like OpenAI’s o1 or DeepSeek’s R1.
如果它用大量的问题重复这个过程，机器人就可以学会模仿人类的推理方式——至少在某些方面是这样。结果是所谓的推理系统，如 OpenAI 的 o1 或 DeepSeek 的 R1。

Dr. Barto and Dr. Sutton say these systems hint at the ways machines will learn in the future. Eventually, they say, robots imbued with A.I. will learn from trial and error in the real world, as humans and animals do.
巴托博士和萨顿博士说，这些系统暗示了机器未来将如何学习。他们说，最终，充满人工智能的机器人将像人类和动物一样，从现实世界的试错中学习。

“Learning to control a body through reinforcement learning — that is a very natural thing,” Dr. Barto said.
“通过强化学习来学习控制身体——这是一件非常自然的事情，”巴托博士说。

wiki

For reinforcement learning in psychology, see Reinforcement and Operant conditioning.

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
强化学习（RL）是机器学习和最优控制的一个跨学科领域，涉及智能代理应该如何在动态环境中采取行动，以最大限度地提高奖励信号。强化学习是三种基本的机器学习范式之一，另外两个是监督学习和无监督学习。

Reinforcement learning differs from supervised learning in not needing labelled input-output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge) with the goal of maximizing the cumulative reward (the feedback of which might be incomplete or delayed).[1] The search for this balance is known as the exploration–exploitation dilemma.
强化学习与监督学习的不同之处在于，不需要呈现标记的输入-输出对，也不需要明确纠正次优作。相反，重点是在探索（未知领域）和利用（当前知识）之间找到平衡，目标是最大化累积奖励（其反馈可能不完整或延迟）。[1] 寻找这种平衡被称为勘探-开发困境。

The environment is typically stated in the form of a Markov decision process (MDP), as many reinforcement learning algorithms use dynamic programming techniques.[2] The main difference between classical dynamic programming methods and reinforcement learning algorithms is that the latter do not assume knowledge of an exact mathematical model of the Markov decision process, and they target large MDPs where exact methods become infeasible.[3]
环境通常以马尔可夫决策过程（MDP）的形式表示，因为许多强化学习算法使用动态编程技术。[2] 经典动态规划方法和强化学习算法之间的主要区别在于，后者不假设了解马尔可夫决策过程的精确数学模型，并且它们针对精确方法变得不可行的大型 MDP。[3]

什么是强化学习？

强化学习的应用场景

广告和推荐

对话系统

强化学习的主流算法

纽约时报：Turing Award Goes to 2 Pioneers of Artificial Intelligence

wiki

相关文章：