Analyzing Instances

Workflow execution instances have been widely used to profile and characterize workflow executions, and to build distributions of workflow execution behaviors, which are used to evaluate methods and techniques in simulation or in real conditions.

The WfCommons project targets the analysis of actual workflow execution instances (i.e., the workflow execution profile data and characterizations) in order to build Workflow Recipes of workflow applications. These recipes contain the necessary information for generating synthetic, yet realistic, workflow instances that resemble the structure and distribution of the original workflow executions.

A list of workflow execution instances that are compatible with WfFormat is kept constantly updated in our project website.

WfInstances

A workflow execution instance represents an actual execution of a scientific workflow on a distributed platform (e.g., clouds, grids, HPC, etc.). In the WfCommons project, an instance is represented in a JSON file following the schema described in WfFormat. This Python package provides an instance loader tool for importing workflow execution instances for analysis. For instance, the code snippet below shows how an instance can be loaded using the Instance class:

import pathlib
from wfcommons import Instance
input_instance = pathlib.Path('/path/to/instance/file.json')
instance = Instance(input_instance=input_instance)

The Instance class provides a number of methods for interacting with the workflow instance, including:

draw(): produces an image or a pdf file representing the instance.
leaves(): gets the leaves of the workflow (i.e., the tasks without any successors).
roots(): gets the roots of the workflow (i.e., the tasks without any predecessors).
write_dot(): writes a dot file of the instance.

Note

Although the analysis methods are inherently used by WfCommons (specifically WfChef) for Generating Workflows Recipes, they can also be used in a standalone manner.

The Instance Analyzer

The InstanceAnalyzer class provides a number of tools for analyzing collection of workflow execution instances. The goal of the InstanceAnalyzer is to perform analyzes of one or multiple workflow execution instances, and build summaries of the analyzes per workflow’ task type prefix.

Warning

Although any workflow execution instance represented as a Instance object (i.e., compatible with WfFormat) can be appended to the InstanceAnalyzer, we strongly recommend that only instances of a single workflow application type be appended to an analyzer object. You may though create several analyzer objects per workflow application.

The append_instance() method allows you to include instances for analysis. The build_summary() method processes all appended instances. The method applies probability distributions fitting to a series of data to find the best (i.e., minimizes the mean square error) probability distribution that represents the analyzed data. The method returns a summary of the analysis of instances in the form of a Python dictionary object in which keys are task prefixes (provided when invoking the method) and values describe the best probability distribution fit for tasks’ runtime, and input and output data file sizes. The code excerpt below shows an example of an analysis summary showing the best fit probability distribution for runtime of the individuals tasks (1000Genome workflow):

"individuals": {
    "runtime": {
        "min": 48.846,
        "max": 192.232,
        "distribution": {
            "name": "skewnorm",
            "params": [
                11115267.652937062,
                -2.9628504044929433e-05,
                56.03957070238482
            ]
        }
    },
    ...
}

Workflow analysis summaries are used by WfChef to develop Workflow Recipes, in which themselves are used to generate realistic synthetic workflow instances.

Probability distribution fits can also be plotted by using the generate_fit_plots() or generate_all_fit_plots() methods – plots will be saved as png files.

Examples

The following example shows the analysis of a set of instances, stored in a local folder, of a Seismology workflow. In this example, we seek for finding the best probability distribution fitting for task prefixes of the Seismology workflow (sG1IterDecon, and wrapper_siftSTFByMisfit), and generate all fit plots (runtime, and input and output files) into the fits folder using seismology as a prefix for each generated plot:

import pathlib
from wfcommons import Instance, InstanceAnalyzer

# obtaining list of instance files in the folder
INSTANCES_PATH = pathlib.Path('/path/to/some/instance/folder/')
instance_files = [f for f in INSTANCES_PATH.glob('*') if INSTANCES_PATH.joinpath(f).is_file()]

# creating the instance analyzer object
analyzer = InstanceAnalyzer()

# appending instance files to the instance analyzer
for instance_file in instance_files:
    instance = Instance(input_instance=INSTANCES_PATH.joinpath(instance_file))
    analyzer.append_instance(instance)

# list of workflow task name prefixes to be analyzed in each instance
workflow_tasks = ['sG1IterDecon', 'wrapper_siftSTFByMisfit']

# building the instance summary
instances_summary = analyzer.build_summary(workflow_tasks, include_raw_data=True)

# generating all fit plots (runtime, and input and output files)
analyzer.generate_all_fit_plots(outfile_prefix='fits/seismology')