Deep Reinforcement Learning
What is deep reinforcement learning?
Deep reinforcement learning can best be explained as a method to learn to make a series of good decisions over some time. It’s how humans negotiate the world from the very moment they’re born.
Babies who smile at their parents and are rewarded with approval learn that smiling prompts affection. Likewise, they learn that crying brings attention, summoning a parent to get rid of some annoyance or provide comfort.
Parental approval or attention is the most basic form of reinforcement: when an action is met with a reward or a penalty.
But, of course, a child’s learning does not stop there. The baby interprets their parents’ feedback, and a new set of actions becomes available based on mom or dad’s reaction.
The strategy driving the baby's decisions based upon this feedback is called a “policy” in reinforcement learning. It’s the approach the baby uses in pursuit of their goals. Each decision becomes an “action” the baby takes in response to their current situation, called the “state” in reinforcement learning.
This simple description exposes many complexities, including the environment in which the baby makes their choices, the number of available options (action set), and the ideas driving the decisions (policies). Each of these aspects can be represented or learned in different ways.
One such method that has become incredibly popular in recent years uses deep learning to ascertain the model of the environment, policies, rewards, or penalties, and other components of reinforcement learning.
But before we can adequately explain this complex topic, we need to understand its predecessors.
Machine learning refers to using and developing computer systems capable of learning and adapting without following explicit instructions. Instead, they use algorithms and statistical models to analyze and draw inferences from patterns in data.
Artificial intelligence is a subset of machine learning designed to function like the human brain using algorithms called an artificial neural network. These algorithms help computers discover possibilities no human could achieve.
Complex mathematical formulas are central to this process. Moreover, many such formulas might be necessary for machine learning, which works best with large datasets.
By combining reinforcement learning and deep learning, deep reinforcement learning replaces the need for multiple complex formulas.
If we knew the outcome of every decision we could make, we wouldn’t need deep reinforcement learning. In that case, we could create an algorithm to tell us which decision to make to achieve a specific outcome.
But we can’t accurately predict our complex world—we need tools like deep reinforcement machine learning to help us account for problems with multiple variables.
Deep reinforcement learning background
The Bellman equation, named after American mathematician Richard E. Bellman, was developed in the 1950s and is necessary for optimal decision sequences. The equation, central to reinforcement learning algorithms, lets users know what long-term reward they can expect given the state they are currently in if they take the optimal action now and at every future step in the future.
To use the equation, one must first comprehend the environment and learn the rewards. Because of these requirements, scientists previously used the Bellman equation only to solve more straightforward problems with smaller decision spaces, like navigating a maze.
It wasn’t until decades later that scientists began making real strides in developing reinforcement algorithms, trying to replicate the type of decision-making carried out by the human brain—but in computer form.
Early algorithms for reinforcement learning were not particularly powerful. At the time, scientists could not capture very complex knowledge.
Reinforcement learning, which allows a computer to learn from its mistakes using trial and error, didn’t become more practicably applicable until 2011 after deep learning made its debut.
In the mid-1960s, Soviet mathematician Alexey Ivakhnenko and his associate, Valentin Grigor’evich Lapa, crafted small but viable artificial neural networks.
In the early 1980s, Caltech physicist John Hopfield’srecurrent neural networksmade headlines,reigniting interest in the field. His work, closely based on neuroscience research about learning and memory, was fundamental to all that came after.
Deep learning has seen remarkable success, proving superior to traditional machine learning approaches in various application areas. These include computer vision, speech recognition, language translation, art, medical information processing, robotics and control, and cybersecurity.
The recent resurgence of interest in deep learning came as the result of three significant factors. Scientists, for the first time, had access to tremendous amounts of data. Between 2020 and 2025, the total amount of information created, captured, transferred, and consumed is expected to be twice the amount of information humanity produced until 2020.
Secondly, computing power has seen remarkable growth in recent decades. And lastly, algorithms have vastly improved in their ability to perform complex tasks.
These factors come together to support bigger, better, more complex models—the type used to solve complex problems in energy, natural language processing, and many other areas.
One type of algorithms, those associated with deep learning, have resulted in fundamental contributions in several areas such as computer vision, natural language processing, health informatics, and medical diagnostics.
Deep learning significance
Any problem that aims to find a sequence of optimal decisions—from routing traffic, maintaining the power grid, evacuating a city during a flood, or servicing a power station—can be approached using deep reinforcement learning.
Scientific discovery, up until this point, has been chiefly driven by experimentation and simulation—researchers, in trying to understand a phenomenon, replicate it on computers. They then compare the results of the simulation to what we know about real life.
Of course, simulations can create conditions not yet realized in the real world. A scientist could, for example, modify the location of an atom in a molecule or add another atom. They might not experiment in the physical world but can use these tactics to explore the development of new materials and chemical properties they otherwise could not.
Scientists can enhance this exploration through the use of deep reinforcement learning. So, it’s not surprising to see it employed with great success in autonomous robotics. Self-driving cars are just one example. Earlier applications helped build robotic manipulators—almost like toys.
Robotics that make decisions based on these models are now managing warehouses and the movement of goods.
Concepts from reinforcement learning have long been used in games, starting with backgammon. The game of backgammon is simple and is not hard to train a computer to play.Chess is more complex. The game of Go, even more so.
According to an article in the journal Nature, Go is considered the most challenging of classic games for artificial intelligence in part because of the difficulty of evaluating board positions and moves. There are a staggering number of possible board positions—more than, if one can imagine, the number of atoms in the universe.
The game originated in China thousands of years ago and was considered fundamental to any true Chinese scholar.Enjoyed by tens of millions around the globe, the game has simple rules. Players take turns placing black or white stones on a board, earning points by capturing their opponent’s stones, or occupying empty territory on the board.
DeepMind, a company started in the United Kingdom in 2010, developed the first computer program to defeat a professional human Go player. The program, named AlphaGo, was also the first to beat a Go world champion.
Principles of deep reinforcement learning
Deep reinforcement learning helps us make better decisions faster.
Getting back to our human example, let’s age our baby from infancy to high school. Now our experimental child is tasked with picking the perfect college.
It’s an overwhelming decision. The choices are not as few as they were for our child when they were an infant. It’s much tougher to make a good decision when the options are vast.
Each university will have its selling points—rigor, reputation, and location, to name just a few—and each will offer a different financial package. Our student and their parents must account for all of these variables.
Mom and dad might ask around about which program is best. Then, they might consider statistical data about student success after college, examining past examples from other young people who made a similar decision.
Without realizing it, the student and their parents create a model of the environment and what they need to do to be successful. Knowing this environment well helps them make an informed and optimized decision.
They might, for example, narrow this list from hundreds to just about ten and start sending out applications. The decision is critical. They won’t have another opportunity to do this again, and they can’t choose a different school year after year after year in an attempt to get it right.
This practice describes a model-based form of reinforcement learning algorithms of the policy context.
Now, multiply such a dilemma by several orders of magnitude, and one would arrive at a problem a national laboratory might try to solve.
One key concept in decision-making systems is the Markov property, named for Russian mathematician Andrey Markov.For systems that have the Markov property, their future states depend on their current state, not on their condition at previous steps. Stated another way, it would be the equivalent of saying, “the outcome of the decision I’m going to make now will not be affected by what happened in the past, but only by the current situation I’m in.”
Another, of course, is the Bellman equation.
Back to our college-bound high school senior. In selecting a college, they choose a major, examine each school’s offering in that area, and research starting salaries in their field upon graduation, leading to an optimal choice.
The Bellman equation works on the assumption that our student makes the best decisions possible at all times. Good choices lead to good outcomes. According to this theory, if our student makes a good decision now (where to go to college), this will move them to a state where they can make optimal decisions about the future.
Deep reinforcement learning limitations
Deep reinforcement learning can help scientists and researchers make good decisions based on exploring simulation scenarios.
But it’s limited by the same concept—those who seek to use the approach must have a solid grasp of their environment.
Not all systems can be made flawless by using deep reinforcement learning. For example, despite millions upon millions of data points, self-driving cars have not mastered all possible conditions and sometimes make mistakes.
For systems where consequences can be dire, deep reinforcement learning cannot work by itself. For example, one cannot afford to crash a plane to learn how to fly it.Scientists are working to make reinforcement learning safe by augmenting its capabilities with other algorithms and methods.
Deep reinforcement learning needs both time and experiential data to produce the best outcome.
New and future developments in deep reinforcement learning
Scientists are constantly developing new algorithms that would allow deep reinforcement learning to have a better success rate.
Other forms of reinforcement learning are also gaining steam, including inverse reinforcement learning, in which a machine learns from observing an expert. Instead of trying to learn from its own experience, the machine learns from watching others. That expert is not a teacher, just someone or something executing a task, not explaining it.
Goal-conditioned reinforcement learning breaks down complex reinforcement learning problems by using subgoals.
Multi-agent reinforcement learning is instrumental in solving robotics, telecommunications, and economics problems.
The complexity of tasks associated with these fields of study makes them hard to complete with preprogrammed behaviors. The solution? Allow agents to discover the answers on their own through learning.
Deep reinforcement learning at Pacific Northwest National Laboratory
Pacific Northwest National Laboratory is a leader in machine learning and artificial intelligence. PNNL’s artificial intelligence research has been applied across various fields, bolstering national security and strengthening the electric grid. Its DeepGrid open-source platformuses deep reinforcement learning to help power system operators to create more robust emergency control protocols, augmenting and protecting this last safety net for grid security and resilience.PNNL has also been using deep reinforcement learning to make strides in cybersecurity, much of which must be automated because systems are constantly under attack.
To further support national missions, PNNL is working to improve the quality of models by boosting the integrity of the data that informs these systems, increasing their accuracy, interpretability, and defensibility.