Customers today find themselves designing large and complex business flows such as image processing, media transcoding, data pipelines, etc. A trending addition to such a scenario would be MLOps.
Traditional approaches for such scenarios is either a monolithic design, where a single component would be performing all the functionalities or a complex queue(s) + service based architecture with polling.
One of the downsides of going with a monolithic design is the complexity involved in adding/modifying functionalities/processes. Also, scaling the application is challenging and there is limited reuse realization.
In a cloud-based world, the application built must also be capable of integrating with multiple services. A solution to offset such challenges would be to design a distributed workflow.
Distributed Workflow
In a distributed workflow, such large and complex business functionalities are broken down to rudimentary components thus addressing some of the key downsides of monolithic design.
Since there are multiple components working together to achieve a common functionality, there is no single point of failure and each component can be scaled independently without impacting one another. But how will those components interact with each other?
Orchestrator
An orchestrator is one of the components of a distributed workflow architecture that maintains the relationship and handle dependencies between the components. Adding to the surge in the complexity of the design and debugging of a distributed architecture, this orchestrator component needs to be built which is expected to have the following features.
- Easy to build/develop and modify
- An operational management layer to have a bird’s eye view of the orchestration and the current execution
- Ability to catch and handle errors as well as a seamless retry logic
- Tracking and Logging
- Quick integration with multiple different components and interfaces with module reusability
Developing such an orchestrator from scratch is a daunting task considering the complexity of such a component, provision of separate processing and storage capabilities, addressing network latency between components etc.
While, serverless orchestration can solve some of the drawbacks of a custom orchestrator made from scratch as there is increased agility and lower total cost of ownership, the other complexities do persist.
In this article, we will have a look at AWS Step Functions and how it’s integration with other AWS services can help in overcoming the drawbacks of developing a custom-built orchestrator.
AWS Step Functions
AWS Step Functions is an application service/orchestrator that helps to design and implement complex business functionalities by providing a graphical representation of the workflow.
- State machine, a core concept of Step Functions, is a collection of states and handles their relationship with one another which can be developed using a JSON based language called Amazon States Language (ASL). Yes, the orchestrator is built by using a JSON like language.
- A neat out of the box feature that comes with Step Functions is the visual depiction of the orchestration as well as every execution making it easier to troubleshoot and debug. Below is an example of the execution of a sample state machine from the official AWS documentation.
- Faulty executions are always a hassle in production environment. State machines can be developed in such a way that errors are handled at individual steps and the workflow can be adjusted automatically based on specific errors.
Retries can be also included in the ASL so that the orchestrator on its own, can handle certain errors like timeout, etc. And, like error handling, maximum retry attempts, back off rate, interval between retries can be customized.
- CloudWatch, another paramount service in AWS integrates with Step Functions for storing as well as tracking the logs. CloudWatch Rules can be easily setup and configured to track the events in the logs and take automated actions accordingly.
- While Step Functions is generally seen as a service for orchestrating Lambda functions, tasks in ECS and modifying Dynamo DB items, the addition of support for EMR, SageMaker and Glue workloads, has greatly increased the spectrum of services that can be orchestrated using Step Functions.
One of the key features of Step Functions called Activities, is a way to associate code run by a worker hosted mostly anywhere with a specific task in a state machine.
Supporting tools for Step Functions:
- Steppygraph (Python package):
A concise, easy to read, object-function chaining-based declarations. This is used to construct and run dynamic on the workflow based on a programmatic approach.
- MLOps Step functions Data Science SDK:
Create workflows to process and publish machine learning models using Amazon SageMaker and Amazon Step Functions. Now the execution of this workflow is directly done in Amazon SageMaker and can checks the progress of the workflow.
Let us have a look at some of the characteristics of the service that are considered as a drawback
- There isn’t an out of the box feature to restart the state machine execution from the failed state.
- The maximum request size 1 MB per request.
Below is our experience in leveraging step functions for 2 of the popular use cases
- Data Pipeline:
One of our requirements from the customer was to visualize the billing data from various departments under their organization. We went with Step Functions as our workflow orchestrator to efficiently build the data pipeline as the feature of error handling in Step Functions would be vital.
As soon as the data is available in the S3 bucket, a state machine is started using a Lambda function triggered by S3 event notifications. This data is validated and pre-processed using a couple of Glue shell jobs. A Glue spark job was also used to aggregate and process the data and store it as partitioned parquet files in S3. The spark job also triggers a Glue crawler to crawl the data in S3 and update the Glue Data catalog. A notification is sent to the stake holders upon completion of the workflow using SNS. All these steps are orchestrated by the Step Functions catching all the possible errors in every step and notifying the final output.
We used QuickSight as our visualization tool and configured the data source as Athena which connects to the processed data in S3 using Glue Data catalog.
ASL code of the State Machine
{
“Comment”: “Automation workflow using Lambda, Glue and SNS”,
“StartAt”: “GlueShell-Validator”,
“States”: {
“GlueShell-Validator”: {
“Type”: “Task”,
“Resource”: “arn:aws:states:::glue:startJobRun.sync”,
“Parameters”: {
“JobName”: “cur_pre_validator”,
“Timeout”: 15,
“MaxCapacity”: 1,
“Arguments”: {
“–source_bucket.$”: “$.source_bucket”,
“–object_key.$”: “$.object_key”
}
},
“ResultPath”: “$.cur_pre_validator_output”,
“Catch”: [ {
“ErrorEquals”: [“States.ALL”],
“Next”: “Failure”
}
],
“Next”: “GlueShell-PreProcessor”
},
“GlueShell-PreProcessor”: {
“Type”: “Task”,
“Resource”: “arn:aws:states:::glue:startJobRun.sync”,
“Parameters”: {
“JobName”: “cur_pre_processor”,
“Timeout”: 15,
“MaxCapacity”: 1,
“Arguments”: {
“–source_bucket.$”: “$.source_bucket”,
“–object_key.$”: “$.object_key”
}
},
“ResultPath”: “$.cur_pre_processor_output”,
“Catch”: [ {
“ErrorEquals”: [“States.ALL”],
“Next”: “Failure”
}
],
“Next”: “GlueSpark-DataProcessor”
},
“GlueSpark-DataProcessor”: {
“Type”: “Task”,
“Resource”: “arn:aws:states:::glue:startJobRun.sync”,
“Parameters”: {
“JobName”: “cur_and_utilization_data_processor”,
“Timeout”: 60,
“MaxCapacity”: 2,
“Arguments”: {
“–source_bucket.$”: “$.source_bucket”,
“–object_key.$”: “$.object_key”
}
},
“ResultPath”: “$.cur_and_utilization_data_processor_output”,
“Catch”: [ {
“ErrorEquals”: [“States.ALL”],
“Next”: “Failure”
}
],
“Next”: “SendNotification”
},
“Failure”: {
“Type”: “Pass”,
“Result”: “The state machine has errored out!!!”,
“Next”: “SendNotification”
},
“SendNotification”: {
“Type”: “Task”,
“Resource”: “arn:aws:states:::sns:publish”,
“Parameters”: {
“TopicArn”: “<<SNS Topic ARN>>”,
“Message.$”: “$”
},
“End”: true
}
}
}
- MLOps Pipeline
One of our customer requirements is to generate the forecast reports for every three months with the import of latest dataset to AWS forecast, run the predictor model. After a significant wait time for predictor model to complete, a forecast job must be triggered to export the data onto S3. To overcome this wait time, we generated a Step Function workflow that is triggered when the datasets are loaded onto S3 for generating, testing, and comparing Amazon Forecast predictors and forecasts.
An Amazon S3 event notification triggers a lambda function to start the execution of State machine. The step function starts creating dataset group, importing dataset from S3 to AWS forecast, run the predictor model and runs the forecast to export the report to S3. Once the forecast export is done, an AWS Simple Notification Service (SNS) notifies the administrators users with the results of the AWS Step Functions