Analytical Modeling Projects

Program Objective

Predicting how, and explaining why, scientific workflows achieve the observed end-to-end performance continues to be an outstanding problem for today’s scientists and network operators. Simple estimates that assume a file transfer speed equals local link speed fail to take multiple factors (e.g., network paths, protocol functions, storage system behavior) into account. These simple models will increasingly provide inaccurate information as link speeds increase beyond 10 Gbps and scientific discovery environments become more complex. In today’s distributed computing environment even knowledgeable network engineers and host system administrators have difficulty coming up with adequate explanations that describe changes in a workflows behavior (e.g., a file transfer took 3 times longer today than it did yesterday). Understanding how workflows perform in tomorrow’s world of 100 Gpbs networks, many-core/hybrid-core processors, data intensive instruments, and hierarchical storage systems requires a renewed effort to develop realistic models of this distributed computing environment.

The purpose of this Funding Opportunity Announcement (FOA) is to solicit applications that will significantly enhance our ability to predict how scientific workflows perform and to explain workflow behavior on distributed extreme-scale infrastructures built from computers, storage, instruments and multi-gigabit networks.

Funded Projects:

Panorama LogoTitle:  Panorama: Predictive Modeling and Diagnostic Monitoring of Extreme Science Workflows
PI: E. Deelman – USC/ISI
Website: https://sites.google.com/site/panoramaofworkflows/
Abstract:
Scientific workflows are now being used in a number of scientific domains including astronomy, bioinformatics, climate modeling, earth science, civil engineering, physics, and many others. Unlike monolithic applications, workflows often run across heterogeneous resources distributed across wide area networks. Some workflow tasks may require high performance computing resources, while others can run efficiently on high throughput computing systems. Workflows also access data from potentially different data repositories and use data, often represented as files to communicate between the workflow components. As the result of the data access patterns, workflow performance can be greatly influenced by the performance of networks and storage devices.

Up to now workflow performance studies have focused on modeling and measuring the performance of individual tasks, primarily taking into account the behavior of computational tasks, often ignoring data management jobs. In turn much work in workflow scheduling and resource provisioning for workflow application has focused on the managing computations and to some degree ignoring data movement and storage. Today’s workflow monitoring tools again focus primarily on task monitoring, viewing data management tasks as black boxes. At the same time, network monitoring tools have focused on low-level network performance that cannot be easily correlated with the workflows utilizing the network.
The main questions this work aims to address are: 1) how to develop analytical models that can predict the behavior of complex, data-aware scientific workflows executing in extreme scale infrastructures? 2) what monitoring information and information analysis is needed for performance prediction and anomaly detection in scientific workflow execution?, and 3) how to adapt the workflow execution and the infrastructure to achieve the potential performance predicted by the models. These questions will be addressed within the context of two DOE applications: Climate and Earth System Modeling (CESM), which processes large amount of community data and Spallation Neutron Source (SNS), which produces rich experimental data used in a variety of complex analysis.

The program of research focuses on data-aware workflow performance modeling, monitoring, and analysis and integrates this diverse information into knowledge about workflow behavior that can inform the scientist and the infrastructure providers about the observed performance issues and their causes. This work will develop end-to-end workflow-level analytical models that capture the behavior of the workflow tasks performance on a variety of systems as well as workflow data movement and storage across different networks and devices. The analytical models will be coupled with simulation-based models to increase fidelity of the predictions in dynamic environments. The models will be validated through experiments on DOE infrastructures (such as the ESnet testbed and production infrastructure, the ORNL facilities), on distributed testbeds like ExoGENI, and through simulations.

The work will result in analytical models, workflow-level monitoring tools and monitoring recommendations for existing tools, which capture not only computational task behavior but also that of the data transfer and storage activities in the workflow. An analysis capability will correlate workflow monitoring information with resource performance measurements to provide a better understanding of which resources contributed to the observed behavior. The analytical models will be used to guide anomaly detection and diagnosis, resource management and adaptation, and infrastructure design and planning.

Rameses B LogoTitle:  RAMSES: Robust Analytical Models for Science at Extreme Scale
PI:  I. Foster - ANL
Website: https://sites.google.com/site/ramsesdoeproject/home
Abstract:
The RAMSES project aims to develop a new science of end-to-end analytical performance modeling that will transform understanding of the behavior of science workflows in extreme-scale science environments. Much as modern design tools permit digital pre-assembly and simulated flight of new aircraft before they are built, predictive control during flight, and predictive trending for failure detection, our new methods will provide a basis for tools that allow developers and users of science workflows and operators of science facilities to predict, explain, and optimize.

Modern science is increasingly data-intensive, compute-intensive, distributed, and collaborative. The science workflows that underpin a typical project often couple multiple instruments, data stores, computational tools, and people. The performance of these science workflows has a profound impact on both the pace of scientific discovery and the return on investment from DOE’s multi-billion-dollar facilities budget. Yet our ability to predict the behavior of a science workflow before it is implemented, to explain why performance does not meet design goals, and to architect science environments to meet workflow needs are all grossly inadequate. These are the challenges that RAMSES will address.

The foundation for this new science of workflow modeling is the fusion of multiple distinct threads of inquiry that have not, until now, been adequately connected: namely, first-principles performance modeling within individual sub-disciplines (e.g., networks, storage systems, applications), and data-driven methods for evaluating, calibrating, and synthesizing models of complex phenomena. What makes this fusion necessary is the drive to explain, predict, and optimize not just individual system components but complex end-to-end workflows. What makes this fusion possible is the unique multidisciplinary expertise of the RAMSES team and an unprecedented experimental program that will collect data on a broad range of model components and end-to-end application behaviors. The resulting data will be stored in a new performance database, along with estimated parameters and other information on model and workflow performance. Together, these program elements will allow us to evaluate existing and new models, synthesize new models, and assess model composition methods with a breadth and rigor not previously possible.

We intend that RAMSES research will not only advance knowledge but also provide a path to transforming how science workflows and science infrastructures are designed, implemented, and operated across the DOE science complex. To this end, we will prototype a suite of end-to-end analytical performance models, experimental testing tools, and numerical methods. We will also develop a performance advisor that will provide actionable feedback regarding achieved performance. We will apply these tools to a range of performance analysis problems, including prediction, explanation, and optimization of performance metrics for the Globus file transfer service. This work will both showcase what RAMSES technologies can do and allow us to obtain feedback from our stakeholder community.

The potential benefits of RAMSES research for science and technology are considerable. New modeling capabilities that enable better understanding of workflow and infrastructure performance have the potential to accelerate discovery across dozens of DOE science facilities that are used by tens of thousands of researchers from across the nation. To give just two examples: acceleration of a data analysis workflow used at a DOE experimental facility can permit real-time rather than post hoc user feedback, increasing the efficiency of the research performed at that facility by orders of magnitude. Workflow optimizations that reduce the number of CPUs required for analysis allow us to increase the number of people who can profit from leadership computing facilities. Increasingly, workflows comparable to those encountered in extreme-scale science are appearing in industry as well, providing opportunities for yet broader impact.

Title: IPPD: Integrated end-to-end Performance Prediction and Diagnosis for Extreme Scientific Workflows
PI:  D. Kerbyson - PNNL
Website: http://hpc.pnl.gov/projects/IPPD
Abstract:
It is increasingly difficult to design, analyze, and implement large-scale workflows for scientific computing especially in situations where time critical decisions have to be taken. Workflows are designed to execute on a loosely connected set of distributed and heterogeneous computational resources. Each computational resource may have vastly different capabilities, ranging from sensors to high performance clusters. Frequently, workflows are composite applications built from loosely connected parts. Each task of a workflow may be designed for a different programming model and implemented in a different language. Most workflow tasks communicate via files sent over general purpose networks. As a result of this complex software and execution space, large-scale scientific workflows exhibit extreme performance variability. It is critically important to have a clear understanding of the factors that influence their performance and for the potential optimization of their execution.

The performance of a workflow is determined by a wide range of factors. Some are specific to a particular workflow component and include both software factors (application, data sizes etc.) and hardware factors (compute nodes, I/O, network). Others stem from the combination and orchestration of the different tasks in the workflow including: the workflow engine, the mapping of the workflow onto the distributed resources, co-ordination of tasks and data organization across programming models, and workflow component interaction.

In IPPD three core issues are being addressed in order to provide insights into workflow execution that can be used to both explain and optimize their execution: 1) provide an expectation of the performance of a workflow in-advance of execution to provide a best baseline performance; 2) identify areas of consistent low performance and diagnose the reason why; and 3) study the important issue of performance variability. The design and analysis of large-scale scientific workflows is difficult precisely because each task can exhibit extreme performance variability. New prediction and diagnostic methods are required to enable efficient use of present and emerging workflow resources.

An integrated approach to the prediction and diagnosis of extreme-scale scientific workflows is being taken that focuses on the problems of complexity and of contention. Our approach will enable both explore-in-advance and optimization-at-runtime of workflows and their components. Underlying this is a multi-scale view that enables fine-grained analysis of workflow components using simulation-based tools that allow in-depth analysis and, through suitable abstractions, end-to-end workflow analysis and prediction using analytical prediction for extended workflow coverage as well as for rapid evaluation. Fine-level analyses enable individual workflow components to be explored for a multitude of resource use scenarios and provide insights necessary to guide their run-time optimization. Coarse-grained analyses enable an entire workflow to be explored, and to allow for its initial optimization. Machine learning, based on adaptive sampling, will focus analysis on the parameter space that has the greatest influence on performance. Workflow specific benchmarking will provide input to our mod/sim techniques. Provenance information, collected through instrumentation and monitoring of actual workflow execution, will provide empirical bounds on the expected performance.

Title:  X-SWAP: Extreme-Scale Scientific Workflow Analysis and Prediction
PI:  E. Strohmaier - LBNL
Website: https://sites.google.com/a/lbl.gov/x-swap/home
Abstract:
The Extreme-Scale Scientific Workflow Analysis and Prediction (X-SWAP) project is developing an integrated analytic performance modeling framework for distributed extreme-scale science workflows.  These performance models are designed to reflect observed end-to-end performance of current large scientific workflows and predict performance for future extreme scale workflows.

To develop our framework we focus on experiments in the areas of Light-sources (ALS, LCLS), Astronomical surveys (PTF, LSST), and Genomic sequence production (JGI).
These workflows provide test cases for validation of the models on well-understood use-cases and to enable predictions for future, much larger use-cases in a broad array of science areas.

The modeling framework is build upon performance models developed specifically for individual workflow components such as: Computing; Data transfer; and Data access. Component models are constructed with well-defined interfaces between each other to support aggregation into end-to-end workflow performance models. Verification of component and aggregate models on current workflows is an integral part of our development process supported by our two testbeds at NERSC and ESnet. The workflow models developed and their data-collecting instrumentation will allow the development of workflow monitoring and analysis tools and lead us towards better understanding and predictions for the level of science that a given workflow can deliver.