What Are Advanced Computing Testbeds?
Advanced computing has transformed the way we live, touching nearly every aspect of our lives. It has helped us understand and protect our environment, develop new sources of renewable energy, and secure our nation. It is so fundamental to our relationship with the natural world that scientists consider it to be the third pillar of research, alongside theory and experimentation.
Advanced computing testbeds, the proving grounds for new machines, are central to the development of next-generation computers. They allow researchers to explore a complex and non-linear design space and facilitate the evaluation of new computing technologies in terms of performance and efficiency on critical scientific workloads. These “laboratories of machines,” in which multiple components are available for experimentation, are critical to the next greatest advancements in computation.
Advanced computers, which cost hundreds of millions of dollars and take years to plan and build, can be constructed and programmed in countless ways. The most effective among them require fundamentally new answers to questions like application algorithms, programming models, system architecture, component/device technology, resilience, power, and cost.
This “building the machine” requires both theory and experimentation, making advanced computing testbeds ever more essential. They allow scientists, engineers, and designers an opportunity to evaluate their theories to see whether their vision of its programming and architecture will deliver reasonable performance and efficiency. This type of experimentation is akin to the push and pull between theoretical and experimental physics: a scientist imagines what type of system might work, but there is distance between theory and reality.
Advanced Computing Testbeds Background
The research and development of even the oldest “modern” computers—those built during World War II for complex mathematical calculations related to ballistics for artillery—required testbeds.
This continued through the next iteration of computers in the 1950s and ‘60s as these machines moved from mechanical to electronic devices. It was around this time that computers expanded their reach into the business world, where they were used to forecast business models. Early devices were programmed machines that relied on stacks of punch cards. Later innovations in software—specifically, the creation of programming language—spurred rapid change, drastically expanding computers’ capabilities.
The modern era brought about Seymour Cray and his vector machines, which made their debut in the late ‘70s and early ‘80s. These highly specialized computers were particularly valuable in scientific discovery because they could efficiently “vectorize” loops of instructions- rather than applying a set of instructions to each data point individually, sets of instructions could be applied to sets of values. This allowed for rapid processing over vectors of data and proved useful in the area of fluid dynamics and later, climate modeling.
The ‘90s brought a new idea in computing: the linking of multiple devices to one another, increasing their power. The idea worked through the development of petascale computers, those capable of calculating at least 1015 floating point operations per second (FLOPS), or 1 petaFLOPS. The first “petascale” computer was the Roadrunner machine developed by IBM and deployed at Los Alamos National Laboratory in 2008.
Exascale systems will still be large collections of interconnected computer “nodes,” but the nodes themselves are becoming more complex. Scientists expect this trend to continue.
Much of the innovation in computer architecture now—and advanced testbeds—are within the node, instead of ways to connect the nodes together. The explosion in the size of the node-level design space is one contributing factor to why testbeds are important. Heterogeneous architectures and processor designs make it difficult to determine the “best fit” for key workloads of interest. Testbeds allow for “small-scale” experimentation to weigh the trade-offs.
Gordon Moore, co-founder and chair of Intel Corporation and co-developer of the semi-conductor chip, observed in the 1960s that the number of transistors on a unit seems to double every 12 or so months. The rule proved true for the next several decades, greatly boosting computers’ power and efficiency from year to year. But the theory wouldn’t hold out forever: transistor technology hit a wall around the year 2000 and could no longer be relied upon to bring about major advancements in computing.
Scientists had to dig deeper, searching for new, revolutionary technology to improve performance. For instance, this is why they’ve developed “accelerated” systems, in which co-processors (such as graphics processing units, or GPUs) are used to accelerate particularly compute-intensive portions of software. All top-tier large-scale supercomputers are now “hybrid” systems in this way. This constant need for innovation makes testbeds even more relevant.
The Importance of Advanced Computing Testbeds
Advanced computing has proven essential to the development of more accurate climate models, allowing for a far better understanding of local weather patterns. This new information making us better able to predict severe weather and manage the impacts.
It has also been used to accelerate computational chemistry calculations, discovering new types of catalysis for energy storage. In some cases, this means refining already known processes, while in others it means looking for more revolutionary practices.
Advanced computing has also greatly enhanced our understanding of nuclear weapons. There is zero testing now—all is done through simulation. It has also proven critical to protecting major infrastructure, including the national power grid, which consists of more than 7,300 power plants and 160,000 lines of high-voltage power lines. Models allow us to anticipate demand and prevent cascading blackouts.
Advanced computing’s success and expanded capabilities will only continue to shape our future, which is why it remains an essential focus of governments and institutions around the world.
One way to envision the effort needed to improve computer’s power and efficiency from one device to the next is to imagine a freeway: when we change a two-lane road to an eight-lane highway, we have to do more than simply quadruple the amount of material we use. At some point, it would become unwieldy—a driver might have to cross six lanes to make an exit.
Likewise, continually linking computers together will cause similar problems. At some point, it would be too difficult to connect them efficiently, effectively, and sustainably. At least one of the modules or network cables would likely fail, causing the application to crash.
Advanced computers have a million components. There are roughly a billion threads—a stream of instructions that tell the computer what calculations to perform—in each of them, meaning many opportunities for breakdown. One or two might fail every hour. Sometimes, these breakdowns are silent, meaning an undetectable corruption. At the other end is a fatal error that causes a program to crash or some part of the machine to shut down, all of which can greatly interrupt scientific experimentation.
New devices must account for all of these concerns. Not surprisingly, the leap to exascale moved away from simply stringing devices together, looking back to the earlier notion of creating highly specialized machines born to complete a specific set of tasks, with each looking and operating differently with special components and strengths.
Advanced computing testbeds give system architects and domain scientists a vehicle for codesign of the hardware and software components to be included in a next-generation system. Once one of these large-scale systems is deployed, the only way to make sure the delivered performance matches expectations is for software engineers to modify their code or algorithms to conform to the capabilities of the machine. At that point, there are not many degrees of freedom left.
With testbeds, architects and domain scientists can work together to make sure the machine delivers the capabilities the algorithms need, while, at the same time, algorithmic changes can be made to better utilize the machine. The design of the software and hardware is made “in concert”.
Testbeds can be small scale, so numerous options and configurations can be explored. This includes the utilization of specialized hardware, such as GPUs. (GPUs, which were primarily used to develop games, have since been adapted for scientific computing.) Testbeds also provide the ability to explore novel and disruptive technologies without major cost, so designers can consider multiple generations of computing technologies in their designs.
No matter how a new computer is constructed and programmed, energy consumption is a major issue. Some of the world’s most advanced computers use 30 megawatts of power, and a single megawatt would be enough to power 400 to 900 homes.
Of course, energy costs are not the only consideration. They are in addition to the price tag for the device itself plus that of operational staff, application development, and other expenses. Testbeds help scientists account for these costs.
Limitations of Advanced Computing Testbeds
There are many different kinds of advanced computing testbeds. Some explore different processing and methods for processing, while others focus on the computer’s memory and its role in computation. Some memory technologies, for example, can store data when the power is off. Certain memory technologies favor speed (low latency), others high bandwidth, high capacity, and persistence (data remains when powered off).
Networking is another critical issue. How will the data be moved? Data movement is both slow and power intensive and is widely considered one of the most important problems to address. It calls for an incredible amount of hardware—wires and transistors—all of which takes up a large amount of physical space. Setting aside quantum computers, there is only a finite amount of space for these materials.
The methods for moving data have improved incrementally, but the best techniques are reaching the speed of light. As a result, questions of data movement immediately raise questions of machine architecture, programming models, and algorithms. Can we mix compute and memory? Can we write programs in a way that minimize data movement? What is the best way to reason about data movement?
Data storage is another sticking point and comes with tradeoffs relating to the computer’s speed, capacity, persistence, resilience, and overall agility. And this is where testbeds come into play.
A processing testbed is essentially a machine room with perhaps 20 different processors upon which scientists can search for strengths and weaknesses. Components, including processors, memory, storage, networking, and system control mechanisms such as operating system modules, execution runtimes, programming models, algorithms, and workloads, are among the variables.
Though they have allowed for great advancement, advanced computing testbeds are not without limitations and challenges. There is always a danger in extrapolating any experimental data collected beyond the conditions in which it was mined.
This could be, for instance, collecting data on a hardware/software testbed that has significant changes from a final product, or executing an application workload that is different from what will occur in the real world.
This is particularly relevant for high-performance computing (i.e., large-scale computing) because the testbeds are typically “small scale”—often too small (either in terms of available memory or processing power) to perform the full-scale computation.
In advanced architectural testbeds, it is often the case that both the software running on the system, as well as the system itself, are evolving and being developed concurrently. So, there remains a challenge in understanding where the boundaries or limitations lie.
For more exotic technologies, such as quantum computing, scientists must “simulate” the system because there are likely no existing machines with which to experiment. In these cases, scientists must take great care to assure that the simulation tools they develop capture all salient characteristics of the final machine.
New and Future Developments in Advanced Computing Testbeds
Today, there is no consensus on post-exascale advanced computing. The increasing importance of very large datasets is changing the problems that advanced computers solve.
New applications frequently combine traditional scientific computing (simulating physical systems with numerical methods), large data analytics, and machine learning (ML), a branch of artificial intelligence (AI) and computer science focusing on the use of data and algorithms to imitate the way humans learn.
ML represents a new class of computation different from scientific computing. Scientists are not just accelerating the performance of scientific computing, they are converging it with ML.
One possibility for advanced computing is that rather than emphasizing zettascale (exascale x 1000), the solution space becomes more fragmented and involves customization for different problem domains. Common categories include ML, data analytics, or quantum.
There is much interest in customizing compute units, making advanced computing testbeds ever more in demand. The Department of Energy (DOE) Office of Science has invested in testbeds for years at national laboratories around the country.
Pacific Northwest National Laboratory and Advanced Computing Testbeds
Critical to this effort is the Center for Advanced Technology Evaluation, CENATE, at Pacific Northwest National Laboratory (PNNL). Established in 2015 as a first-of-its-kind computing proving ground, it’s played a critical role in vetting next-generation, extreme-scale supercomputers.
CENATE scientists conduct research in a complex laboratory setting that allows for measuring performance, power, reliability, and thermal effects. It’s essential in evaluating emerging technologies, which PNNL is also developing.
CENATE, funded by DOE’s Office of Science, evaluates complete system solutions and individual subsystem component technologies, from pre-production boards to full nodes and systems that pave the way to larger-scale production. It’s helped in the development of accelerators, like the one that will be found in the three upcoming exascale computers, including Frontier at Oak Ridge National Laboratory.
CENATE takes advanced technology evaluations out of isolation, providing a central point for these once-fragmented investigations, incorporating a user facility type of model where other national laboratories and technology providers can access CENATE resources and share in the integrated evaluation and prediction processes that can benefit computing research.
CENATE works closely with technology developers to understand how computing trends will affect the future marketplace and helps domain scientists understand how novel computing technologies can be most effectively used by scientific workloads. It also allows them to understand the impacts computing technology will have on system software.
With cybersecurity in center stage, CENATE’s focus extends beyond high-performance computing and into other areas of DOE computing mission space, including distributed computing, the Internet of Things, future 5G, and other advanced wireless network technologies.
CENATE is not PNNL’s only effort in this area. The laboratory’s Data-Model Convergence (DMC) Initiative is a multidisciplinary effort to create the next generation of scientific computing capabilities through a software and hardware co-design methodology.
PNNL scientists believe the next big advancement in computing will be found in our ability to seamlessly integrate scientific modeling and simulation with data analytics methods that include AI/ML and graph analytics.
Such innovations will push computing beyond its current capabilities to obtain orders of magnitude improvement in efficiency—enabling fundamentally new and transformational science. DMC will be co-designed to tackle challenging problems associated with the analysis and control of the electric power grid and accelerate scientific discovery across a broad range of biology, chemistry, materials, and energy application domains.