Using AWS’s Simple Workflow Service (SWF) with C#

Amazon's Simple Workflow Service (SWF) in AWS provides a model of workflow that is simple to understand but is it simple to get a robust and durable workflow in place? Tom Fischer guides you through the bewildering early stages of your first SWF application, and concludes that workflows inherently take time and effort to get right, but SWF provides a formidable cloud-based solution

The Simple Workflow Service (SWF) provided by Amazon Web Services (AWS) delivers the reliability of queues with the flexibility of workflows. It allows developers to build and consume a formidable workflow engine from custom code.

This article walks through a simple C# application with SWF. Along the way it discusses several concepts of this formidable AWS service.

SWF for Beginners

Developers who are new to AWS may feel intimidated at the prospect of implementing a SWF solution until they have gained some familiarity with Amazon Web Services and prototyped the Simple Workflow Service. This article will help with this, but breezes past general AWS concerns because there are plenty of resources that exist, such as, PLURALSIGHT and Safari Books Online, for learning more. We can’t entirely avoid the preliminaries so, in this section, we’ll explain a bit about SWF.

SWF is (Conceptually) Simple

You need only jump over two conceptual hurdles before you can successfully code a workflow solution. First, SWF only processes messages; it does not execute any custom code. Second, custom code creates and responds to SWF messages. By combining these two ideas, you will see that a workflow solution is all about managing messages. This is easily illustrated as follows.

In SWF parlance, decision activity types handle the “What to Do” chores while activity types account for the requested work. Custom code creates and responds to these two types. The SWF service for its part juggles and delivers these two activity types.

Note: Readers wishing to learn more about the workflow’s lifecycle may find the AWS’ discussion an excellent place to begin their studies.

If you’ve worked with Microsoft MSMQ, you will find this familiar, barring a few abstractions. The most obvious abstraction is that SWF fully manages and hides the underlying queue instance. Another abstraction is the way that Activity types wrap text-based messages. There will be unfamiliar concepts too, such as the fact that there is a Decision Activity type that decides when and what to enqueue and dequeue. MSMQ has no equivalent feature.

There is one significant difference between SWF and MSMQ: SWF neither directly nor indirectly executes custom code. It also does not reference, or house, any custom components or DLLs.

Getting Starting

SWF solutions are made easier to code and debug by using the access credential feature of an AWS account. Access credentials are not the same as the username and password credentials for logging into AWS. They are exposed via the Identity and Access Management (IAM) page.

Access credentials expose AWS resources to developers via command line and integrated development environment (IDE) tools.

The next step towards writing SWF code with Microsoft Visual Studio 2015 is to download and install the AWS SDK for .NET.

The SDK sports several helpful features. The first of these is the creation of a profile that can hold developer’s access keys. This dramatically minimizes security hassles associated with debugging and deploying code to the cloud with Visual Studio. You can load credentials with the AWS Explorer as shown below.

You don’t actually need to entitle a profile ‘default’: Some might prefer to use their account name. The advantage of a common name such as ‘default’ arises when sharing code. The SDK allows for designating a profile name within a project as shown below.

Developer-specific profile names would require that each developer would either need to update the ‘AWSProfileName’ property or create shared access keys. Few developers would find either option palatable, I suspect.

You get a lot of flexibility by being able to specify an AWS region, but doing so is not trivial. All this article’s efforts occur within ‘us-west-2’. If you are working purely in code, it is almost impossible to specify the wrong region, because you have to make it explicit. But when you are working within the AWS user-interface, it’s a detail that is easily missed, as you will see from the toolbar snippet for our ‘stdemodevuser’ user.

Developers who are unfamiliar with AWS’ region definitions might understandably not know that ‘us-west-2’ equates to the ‘US West (Oregon)’ selection.

Before tackling our demonstration solution, readers may find it helpful and informative to create and execute the Aws Image Process Workflow Sample project provided with the SDK. Create it by selecting the template as shown below. As we’ve already discussed, you will need some valid access keys before you start.

If you run the ‘Aws Image Process Workflow Sample’ project with adequate permissions, you will see the WPF form below. You don’t need to click the ‘Start Workflow execution’ button though it is a fun exercise. (Don’t forget to select an image and name an S3 Bucket before you do!)

Before leaving the Aws Image Process Workflow Sample, inspect the HistoryIterator class in the project. Our solution borrows this wrapper exposing the all import event history generated by workflow instances as displayed via the ‘My Workflow Executions’ page of AWS’ Amazon Simple Workflow Service Dashboard. The Events tab shown below resulted from putting the ‘Aws Image Process Workflow Sample’ through its paces.

Note: You may incur costs, although they are nominal as of December 2016, running the Aws Image Process Workflow Sample project.

Building an SWF Solution

Ironically, the effort that is required to implement an SWF solution makes you wonder why the ‘S’ stands for ‘Simple’. Code and workflow properties must dovetail tightly to properly exploit SWF, and that can make things complicated. Before diving into the code we’ll begin by creating several artefacts that the SWF service expects.

SWF Setup

Before you can run a workflow, you need three SWF artefacts: domain, workflow and activity. The domain acts a container for workflows and activities, along with related event history. Workflows and activities cannot jump domains. Therefore, our demonstration begins by creating the ‘StDemoDomain’ domain. We accomplish this via the Amazon Simple Workflow Service Dashboard and clicking the ‘Manage Domain’ button which brings up another page that includes a ‘Register New’ button.

By clicking the ‘Register’ button, we add our solitary ‘StDemoWorkflow’ workflow. Multiple workflows may be added to a domain. For our purposes, one is sufficient.

Once the domain exists, don’t forget to select it in ‘Amazon Simple Workflow Service Dashboard’, as shown below, before registering any other items.

Now we can add a workflow to the domain!

Before clicking ‘Register Workflow’, it pays to scan the properties. Most of them deal with runtime behavior and can be overridden in code. When experimenting with SWF, most of the above property values will work well enough for starting out.

The last chore before writing code is to register activities for our demonstration. For simplicity’s sake, we created a family of them with equally expressive names, ‘DemoActivity1’, ‘DemoActivity2’, ‘DemoActivity3’ and ‘DemoActivity4’. Except for the ‘Activity Type Name’ and ‘Description’ all of them are similarly defined.

Scan the properties before clicking ‘Register Activity’ in the ‘Create Activity’ wizard review page. Most of these deal with runtime behavior and can be overridden in code when necessary.

One final note about registration: Any developers who wish to avoid these manual steps may want to find the facility to register any and all SWF item via code. The AWS Image Process Workflow Sample contains sample code for bypassing the chore of manual item registration.

The Big Picture

Our demonstration implements two workflows. They consist of patterns regularly encountered in production applications. Both occur with the recently registered ‘StDemoWorkflow’, ‘DemoActivity1’, ‘DemoActivity2,’ ‘DemoActivity3’ and ‘DemoActivity4’ SWF items.

Before we delve into the details of our workflows, note that none of our registered items contain any information about the flow. We have not configured anything anywhere that might inform SWF that ‘DemoActivity2’ follows ‘DemoActivity1’. Developers who are familiar with other workflow products, such as Microsoft BizTalk Server, may have already noticed this detail. As we explore in the demonstration implementation, SWF expects the developer to construct workflows via decision activity types

Scenario 1

The first scenario places two parallel activities within a serial workflow. Our discussion of it shows how we can make a workflow via decision activity types.

Scenario 2

The second scenario retries an activity after an elapsed time period. This common pattern requires that we more aggressively exploit the SWF event history as we’ve already mentioned in discussing the Aws Image Process Workflow Sample project.

With preliminaries out of the way, let’s explore the implementation.

Demonstration Solution

The SwfDemo solution implements our two scenarios. Each executes within the same console application, allowing a developer to step through and debug each of them. Remember that SwfDemo just demonstrates a concept. For example, the code generating decision activities may reside on different servers or processes to improve throughput versus code processing the “worker” activities.

Solution Overview

SwfDemo contains one project with the same name and several classes.

  • ActivityManager – Once started, it continuously asks SWF for the next task to process via its Poll function, then executes business logic via ProcessTask, and reports back to SWF via CompleteTask.

  • DecisionActivityManager – Once started, it also asks SWF continuously for any decision tasks requiring the creation of a decision list. With one in hand the CreateDecisionList function finds the latest WorkDemoActivityState from which it creates a decision activity with one of three functions: CreateDecisionActivity, CreateTimerDecision, or CreateCompleteWorkflowDecision.

  • HistoryIterator – This reads the workflow instance’s event log (see earlier discussion). Readers who are interested in learning more about it should review comments found in the Aws Image Process Workflow Sample project’s implementation.
  • Program – Console application driver initiating both the ActivityManager and DecisionActivityManager, as well as, responding to user workflow requests.
  • WorkDemo – Simulates ‘real work’ that the ActivityManager executes.
  • WorkDemoActivityState – A collection of workflow properties that is passed between SWF artefacts. These define the instance as well as assist decision-making.

These displayed references constitute SwfDemo’s minimum needs. Production applications will likely include business-specific references or other system components facilitating communications with business data stores, lambda functions, services, queues, etc.

Note: SWF does not require Newtonsoft.Json; it just eases working with JSON for our purposes.

The rest of this section explores SwfDemo key classes. We begin with what is apparently the simplest, WorkDemoActivityState, and end with the most complex, DecisionActivityManager.

WorkDemoActivityState

This unassuming class serves as the home of the information which workflow instances inspect, respond to and update. WorkDemoActivityState enables our workflow to remain stateless since it is passed between the different SWF items. Managing state via an information-rich object such as WorkDemoActivityState is optional, a subject we will revisit near the end of the article.

ScenarioNumber exemplifies a typical workflow instance state property. It informs both ActivityManager and DecisionActivityManager which of the different scenarios the user requested. In a production application, this information might define business-specific information, such as customer account or inventory item number. On the other hand, EventType, an SWF runtime property, does not denote any business interest. Rather, it is helpful data for the DecisionActivityManager as we will shortly see.

WorkRequested and WorkCompleted indirectly record the different activities via their name’s suffix values. For example, a 2 integer in WorkRequested implies that ‘DemoActivity2’ has been requested. There is a risk in including such information in a workflow state manager class. Managing such details demands close attention as you’ll soon learn.

Despite an unassuming nature, workflow state management classes constitute a critical element of most SWF solution designs. For example, external business information properties will likely lead to the execution of different workflows, such as one updating customer accounts or another checking inventory stock levels. Likewise, custom code may read workflow internal processing properties to determine next steps.

Program

The main method serves two purposes. The first purpose is to kick off tasks for processing the two activity types. Although these tasks could reside on other servers and in different processes to improve scale, they’re run together in our demonstration for simplicity. The loop that follows contains the second. It traps user input to initiate new workflow instances via ExecuteScenario.

In order to get meaningful output, you need to inspect the SWF Dashboard when executing Main. We’ll show this later in the article.

ExecuteScenario kicks off the user-requested workflow. It begins with constructing a WorkDemoActivityState object. It is then passed to an AmazonSimpleWorkflowClient instance that is configured to communicate with the ‘USWest2’ AWS endpoint. Although the selection of regions is unrestricted, our SWF items reside in ‘USWest2’ (Oregon) which forces its designation.

Without wishing to bore readers by repeating information found in the documentation, there are a few mistakes that are easily made when configuring a workflow request. Domain and WorkflowType Name and WorkflowType Version must exactly match registered values; capitalization matters. WorkflowId does not have to be a GUID, but it must be unique among workflow instances. TagList is not required but provides informative information when looking at workflow information in the Amazon Simple Workflow Service Dashboard.

Setting ExecutionStartToCloseTimeout to 30 seconds is not an entirely arbitrary choice. It is just enough time for our scenarios to complete. Deciding how long a workflow instance “lives” becomes critical in production environments. While SWF cannot terminate any executing activity processing when a workflow expires, it will cease generating, receiving and reacting to any activity messages.

ActivityManager and WorkDemo

Once triggered via the Start method, ActvityManager continuously asks SWF if there is an activity for it to ponder. If one exists, it first performs the “business” function via WorkDemo’s WasteTime method and reports back to the service.

The unending while loop leverages the ubiquitous AmazonSimpleWorkflowClient to talk with SWF. The first chore it performs is ask SWF for an ActivityTask via the Poll method. Within Poll the request must include both Domain and TaskList properties. The loop ends with an ugly Thread.Sleep(100) to avoid hogging the CPU: Although ugly, the task of dealing with such concerns may not disappear in a production solution

While discussing the full impact of the TaskList goes beyond the scope of this article, there are a few points worth noting. First, juggling the TaskList is an advanced option. Second, our demonstration employs the ‘defaultTaskList’ Name everywhere keeping us and SWF from getting confused. For example, TaskList impacts which activities SWF returns when querying it for decisions or activities.

After patiently waiting, Pool eventually gets a response with an activity for processing. Within the response we extract and serialize the aforementioned WorkDemoActivityState instance containing our business and workflow information enabling us to call WasteTime.

When ProcessTask finishes the business critical processing, it reports back to SWF how things fared. Within the RespondActivityTaskCompleted call it is important to append the updated WorkDemoActivityState object. Otherwise, our actions will go unknown to subsequent activities.

WorkDemo does not perform any real work other than burn computer cycles and add the activity number to the WorkCompleted list. Including some delay is important when fleshing out SWF prototypes. Managing timeouts becomes an important consideration when consuming production resources.

DecisionActivityManager

The success of most SWF solutions resides in a class like DecisionActivityManager. It starts off much like ActivityManager except now it looks for the next decision activity which SWF creates after it receives information after ActivityManager executed CompleteTask.

The endless loop becomes more interesting when SWF provides a meaningful decision activity as determined by the existence of a non-null TaskToken. When that happens, create a list of decisions for return to SWF. This list embodies the undeclared workflow scenario flow.

Building the decision list involves two tasks:

  1. Interpreting what has happened thus far in the workflow instance as recorded by SWF
  2. Deciding what activity or activities comes next based on that history.

There are many ways of building decision lists. The method that follows suits this particular job. It goes through the Event history, reconstructing the most recent WorkDemoActivityState instance with which we will choose the activities for inclusion in the task list. Whatever the tactics, responding to a workflow instance’s Event history is required in any non-trivial SWF application.

After acquiring the raw Event history via the HistoryIterator helper, we build a history of WorkDemoActivityState instances from which we cull the most recent. The job of constructing each WorkDemoActivityState begins with inspecting what SWF Workflow event it is associated with. With that insight we can appropriately read and update WorkDemoActivityState objects.

The first event that we respond to occurs after initializing a workflow instance. Rehydrating the WorkDemoActivityState provides the business property ScenarioNumber; the non-business properties EventTimestamp and EventType are read directly from the history event instance. The WorkRequested and WorkCompleted lists are initialized for subsequent usage.

The next event handler reflects how SWF works. The ActivityTaskCompleted history event is SWF’s response after ActivityManager worked through the prior decision list. That necessitates SWF to query for the next decision list.

Discriminating between the different homes for WorkDemoActivityState values in event history is easily overlooked. EventType.WorkflowExecutionStarted history stashes it in the appropriately named Input property. EventType.ActivityTaskCompleted history prefers the equally suitably named Result property.

The last two conditions of the loop are not as obvious as first two. They attempt to summarize timer events in a meaningful fashion for the application. It enlists a temporary Dictionary with string keys and WorkDemoActivityState instances. The method dictionary, timerWorkDemoActivityStates, allows us to recall the WorkDemoActivityState associated with the request for a specific timer via TimerId.

It’s understandable if these gyrations strike the reader as a kludge. But, they reflect how SWF manages activities. SWF includes minimal information EventType.TimerFired. If we want the WorkDemoActivityState associated with the timer, we need to inspect the linked EventType.TimerStarted history event.

After working through all historyIterator items and constructing the related WorkDemoActivityState list, we are ready to return the prime object of our attention, the most current one.

DecisionActivityManager’s CreateDecisionList constructs workflow instance decision lists based on the latest WorkDemoActivityState. It relies on a switch statement acting on ScenarioNumber to fabricate it.

Scenario 1 begins by checking whether or not it has processed the last activity, ‘DemoActivity4’. It does so by looking for the activity name’s last character in the WorkCompleted list. If so, it sends a message to SWF that the workflow has ended via CreateCompleteWorkflowDecision.

Our next condition checks if ‘DemoActivity1’ as been requested by looking for the activity name’s last character in the WorkRequested list. If true, it sends a message to SWF that it needs to spin up an activity request with CreateDecisionActivity.

The check for ‘DemoActivity2’ behaves differently from the last one. When WorkRequested does not contain the expected 2 integer value, we call CreateDecisionActivity twice for ‘DemoActivity2’ and ‘DemoActivity3’ thus creating parallel activities.

Attentive readers will no doubt have noticed that our application will not truly process ‘DemoActivity2’ and ‘DemoActivity3’ in parallel. Nonetheless, it is possible by adding an another .NET Task executing ActivityManager.Start in Main. In a production environment code akin to ActivityManager.Start will likely reside in other servers, tasks and processes to enhance scale.

The first scenario concludes by checking for ‘DemoActivity2’. If found, it adds the last activity, ‘DemoActivity4’ in a fashion similar to that of the first activity.

Decision-making mechanics for the second scenario differs from the first. In this simpler workflow the code just wants to know how many times ‘DemoActivity1’ has been requested. Based on that count it can add a ‘DemoActivity1’ as already seen; create a timer-driven decision activity type via CreateTimerDecision; or end the workflow.

Unlike much of the other decision making logic within CreateDecisionList, we now check the EventType. This information helps us decide whether we need to add a first or second ‘DemoActivity1’ activity request or create a decision timer.

The DecisionActivityManager class review concludes with a few helpers. While much of it may now seem familiar, there a few details worth noting. First, ActivityId and TimerId must be unique for their activity types among all active workflow instances. When SWF encounters duplicate ids it assumes something went awry and reissues a request for a new decision task list. Second, note the different attribute classes and their property name referencing the serialized WorkDemoActivityState.

Note: DateTime.Now.Ticks does not ensure uniqueness. While fine for demonstration purposes generating unique ids for production usage will require additional deliberation.

Appending activitySuffix to Ticks for the ActivityId prevents SWF from getting confused with duplicates when adding our parallel activities so quickly.

Timer decision activities need to know the number of seconds you expect it to wait before firing its completed event partner. Our instance sets StartToFireTimeout to 15 seconds. While our demonstration loads WorkDemoActivityState into Control, that is not required by SWF. We include it with plans for accessing the object in GetLastWorkDemoActivityState.

Result is optional; but like the aforementioned TagList, it can be helpful when plumbing Dashboard My Workflow Executions for insights.

In Action

Visiting the Dashboard My Workflow Executions page after running each scenario offers an excellent glimpse into SWF processing.

My Workflow Executions displays ‘Open’ or ‘Closed’ workflows during the filtered time period. The default is ‘Open’ and forgetting this detail when searching for a recently closed workflow may frustrate those new to SWF interface.

Reviewing the Activities tab for the first scenario suggests that all the expecting activities, ‘DemoActivity1’ through ‘DemoActivity4’ executed. Closer inspection of the selected ‘DemoActivity4’ properties suggests something may not have gone entirely as expected though.

The Result property, which equates to the workhorse WorkDemoActivityState, informs us that WorkCompleted differs from WorkRequested. According to the WorkCompleted list ‘DemoActivity2’ was never executed despite the Activities tab. What gives?

While SWF supports multiple threads and the like, it does not lock WorkDemoActivityState between activity operations. Unfortunately, our implementation writes to WorkCompleted in one task and reads it in another one. Fortunately, our flawed implementation never suffers the consequences. Avoiding shared state errors starts with the design of the SWF solution.

Our second workflow executes without any obvious issues upon inspection of its Activities.

Before we can congratulate ourselves though, try quickly running several workflows as shown below.

Returning to the My Workflow Executions page after a few minutes have elapsed hints depending on environment there may be hints of a new issue. As show below some workflow instances timed out.

We get the explanation when we add up the time that each workflow consumes, and comparing it to the time allowed when requested. Our StartWorkflowExecutionRequest’s ExecutionStartToCloseTimeout property value of 30 seconds isn’t long enough for our implementation to work through too many simultaneous workflows.

Implementation Considerations

Architecting SWF-dependent applications abounds with design and configuration choices. And as our two scenarios’ issues suggest it is not difficult to getting some wrong. In this section, we note a few thoughts to consider when building solutions.

Cost

Although not a technical worry, the cost of any technology eventually influences any design decision. As of December 2016, though, it should not be a significant factor when considering SWF. With usage charges ranging from free to embarrassingly inexpensive, exploring, prototyping and testing incurs negligible infrastructure related costs. Most production applications find the cost of the resources employed by SWF more noteworthy than those billed for SWF.

Resources

Although SWF minimizes infrastructure costs, it does not do the same for the resources it consumes. However, SWF allows you to segregate almost every implementation facet so it is almost impossible to design solutions without being aware of all the resources ultimately involved.

State

Two broad approaches exist for maintaining state in SWF. The first follows variants of the tack employed in this article’s demonstration code. The alternative involves using an external data store. The first approach usually scales best but demands careful coding. The second also requires care as it introduces an independent resource into the mechanics of SWF.

Configuration

There are many configuration options to help deal with more advanced scenarios, but these are beyond the scope of this article. API support is moderately complete. It ranges from allowing the registration of SWF artefacts to altering runtime properties. This section only touches upon a few of them. I have found the SDK to be adequate when asking tricky questions.

Priority

Our demonstration did not ask any priority questions. However, you can alter the way that SWF orders the polling requests by managing such properties as defaultTaskPriority and taskPriority, This is useful when, for example, if you wish the workflow to handle a workflow request generated from a user interface before a batch oriented request.

TaskList

If there were a “most confusing” title award in the SWF SDK, the taskList property gets my vote. It’s overloaded between activity types and enables segregation by resource for processing. This allows developers to ensure, for example, that only certain servers handle specific activities.

Timeout

Aside from discovering that workflow instances can timeout, we barely scratched the surface of this subject. Exactly how time impacts workflows and activities depends on implementation details. Anticipate monitoring and tweaking them.

Workflows

Our demonstration employed one registered workflow. Although it worked well enough, it was also quite naïve. Production implementations regularly employ multiple workflows to simplify implementations, in terms of code and resources. They also can help minimize risks associated with workflow instances generating exceedingly large event histories.

A Simpler Option?

Many developers who encounter SWF for the first time are likely to ask the reasonable question, “Why so complex?” The answer contains two parts. First, SWF’s full name, Simple Workflow Service, means just that – simple. This complete service does not lend itself to a rapid application development. Second, it is a considerable task building any durable and reliable workflow-driven application, such as SwfDemo, whatever you use to build it.

Late in 2016 AWS addressed such pressing needs with the introduction of the Step Functions service. It facilitates the creation of state machines with a JSON driven, graphic designer. The below picture is an adaptation of one of our scenarios as a Step Function State Machine.

As with most things that sound too good to be true, there’s a catch. Step Functions execute AWS Lambda functions, which come with restrictions. One of these is that the current C# Lambda release only supports the .NET Core runtime. Despite potential roadblocks, the Step Function service promises a more straightforward path for developers wishing to exploit durable workflows with as little hassle as possible.

Conclusion

Workflows are simple until you need them to be durable and robust. The article’s demonstration solution will hopefully guide readers through the bewildering early stages of implementing their first via SWF. Beyond implementation details two generalizations merit note. First, it is neither easy nor difficult with C# to leverage the AWS Simple Workflow Service. Second, carefully consider whether your solution requires a durable workflow capability. Adding workflow to an application isn’t like sprinkling sugar on a cake: It will take time and effort. If an application really requires a grown-up workflow though, SWF stands outs as a promising option.