Advanced Comput, Math & Data
Scientific Process Automation: Enhancing Scientific Analysis and Discovery
SPA develops automated workflows for a variety of computational science domains
Results: A scientific workflow infrastructure developed through the Scientific Process Automation (SPA) thrust area is significantly improving scientists' ability to effectively use computational resources and analyze their data. Over the past 8 years, the SPA team, a multi-institutional team led by Terence Critchlow of Pacific Northwest National laboratory, has developed a useful and usable scientific workflow infrastructure capable of automatically performing many of the repetitive tasks currently being done manually. This technology has been successfully applied in a variety of computational science domains including fusion, astrophysics, biology, climate modeling, groundwater, and combustion. Wherever it has been applied, adoption of this technology has allowed scientists to focus on their science instead of the complexities of the underlying data management.
SPA is a component of the SciDAC Scientific Data Management (SDM) Center. The center is focused on improving scientists' ability to interact with their data in three key areas: reading and writing data through Storage Efficient Access; analysis of large data sets to find features of interest using Data Mining and Analysis; and automation of the overall simulation and analysis process through SPA.
Why it matters: Effectively generating, managing and analyzing scientific data requires a comprehensive, end-to-end approach, from the initial data acquisition to the final analysis of the data. Unfortunately, the various forms of data manipulation typically consume up to 80 percent of the computational time allotted for a simulation. This leaves only 20 percent available for actual scientific analysis and discovery. The SPA project is developing solutions and products for effective and efficient modeling, design, configurability, execution and reuse of scientific workflows. The team already has deployed workflows that allow near real-time monitoring of complex tasks such as the execution of large simulation codes and the analysis of the resulting data. Result: significant improvement in scientists' ability to effectively utilize computational resources and analyze their data.
Methods: Scientific workflows are the formalization of a scientific process that is frequently and repetitively performed. The challenge for the SPA team is to convert these manual, time-intensive, processes into a workflow that can be executed with minimal or no supervision. This is accomplished through the use of a workflow engine such as Kepler, in which an executable version of the process is defined, configured and run. Kepler acts as the orchestrator of the workflow, coordinating both the data transfers to and execution of individual components (called actors) which perform specific tasks.
SPA uses the Kepler workflow engine because:
- Kepler is an open source workflow environment built on the Ptolemy II engine, which has been available for more than a decade. Having a generally distributable and extensible environment is a key requirement for SciDAC.
- There is a strong developer community focused on enhancing and extending the Kepler infrastructure. This community, consisting primarily of researchers working on projects funded by DOE and NSF, includes significant support outside of UC Berkeley (where Ptolemy was initially created). This activity dramatically increases the longevity of the technology.
- Kepler provides multiple mechanisms for control flow and dataflow within the workflow engine. It is easy to perform tasks in parallel, or to force sequential execution of specific steps. This flexibility is important in scientific environments.
- Kepler allows nested workflows, which enables workflow designers to provide appropriate levels of abstraction.
- Kepler is written in Java and is thus portable across most computational platforms.
- Kepler has a GUI that allows workflow developers to graphically define workflows. This increases the accessibility and potential user base of the tool.
In addition to developing and deploying workflows for specific scientific tasks the SPA team has focused on extending the Kepler workflow environment in four areas: data provenance; generic actors; fault tolerance; and a web-based dashboard environment.
What's next: The research efforts in these four areas are continuing. The team is working on further extending the fault-tolerance capabilities within Kepler to allow the workflow to effectively recover from a broader category of interruptions. The team is also working on extending the dashboard to allow modification of the underlying analysis workflow. Eventually, we expect scientists will use this interface to define and initiate complex workflows using templates and wizards.
Acknowledgments: Sponsor: U.S. Department of Energy's SciDAC program.
Key contributors: The SciDAC Scientific Data Management Center is a multi-institutional organization that has brought together leading researchers from a number of universities and national laboratories. The key contributors to the SPA thrust area are:
Arie Shoshani (LBNL): SDM Center Lead PI; Terence Critchlow (PNNL): SPA Thrust Area Lead; Ilkay Altintas (SDSC); Scott Klasky (ORNL); Bertram Ludaescher (UC Davis); Norbert Podhorszki (ORNL); Claudio Silva (Univ. of Utah); Mladen Vouk (NCSU).