Introducing the fastest, easiest way to perform automated canary analysis on service revisions.

Available as an open-source project under the Open Source Initiative.

1. Introduction & Use Case

Canary deployment is a technique that allows revisions of an application or service, referred to as the canary, to be analyzed using a subset of the real network traffic. This deployment technique minimizes potential errors and risks for the majority of users by targeting a minority of the user base (Çeliku). 1) Canary deployments are sophisticated, require considerable configuration, and can be time consuming to both implement and analyze.

Aria is a tool which allows users to simplify the creation, configuration, and analysis of canary deployments. Aria automates much of the onerous infrastructure management while providing the user with the safety and data analysis benefits of canary deploys. Thus, Aria allows the user to focus on configuration which meaningfully impacts analysis of changes to production software.

1.1 Hypothetical

To better understand the use case for Aria, consider ChartHealth – a hypothetical company that provides an electronic medical record software. ChartHealth’s product manages sensitive patient information with a user base spanning hundreds of health care facilities nationally. The ChartHealth product includes features such as: reading/writing to electronic medical records (EMR), aggregation of patient statistics, and data science models used to identify trends across EMRs.

Screenshot showing ChartHealth's homepage.

Recently, ChartHeath has expanded its business to include facilities in most US states spanning from Hawaii to Maine. Additionally, ChartHealth’s software product has grown in complexity due to feature additions as demanded by customers and evolving regulations. These additions have resulted in a rapid growth of both the number of developers employed by the company and size of the product codebase. As a result of expansion to facilities across five time zones, ChartHealth’s product must now have practically zero downtime in order to not lose business to competitors offering similar products.

Diagram showing ChartHealth's migration from monolith to microservices architecture.

In order to meet these new challenges, ChartHealth has decided to migrate their current on-site monolith architecture to a cloud-based microservice architecture. As a result of the planned migration, the company’s monolithic product codebase has been split into multiple services owned and operated by different teams of developers.

1.2 The Problem with Traditional Deployments

When deploying a new revision of software to the monolithic architecture, ChartHealth utilized a traditional ‘big-bang’ software deployment technique. Big-bang software deployments update whole or large parts of an application in one operation. The inherent downsides to big-bang deployments include: exposure of all users to issues due to software changes, longer delays between deployments, extensive testing of the entire codebase per deployment (Skowronski et al.). ChartHealth was able to manage these downsides prior to the recent expansion, when its smaller developer team could more easily manage and comprehend the entire codebase.

In the context of ChartHealth’s planned service-based architecture, there are multiple reasons that the ‘big-bang’ approach is no longer appropriate. First, for example, the EMR management feature must be as high-integrity as possible to satisfy government regulations. Simply replacing the EMR service for all users in one ‘big-bang’ presents unreasonable risk which may result in inaccurate records capture or deletion. Second, even on a service-by-service basis, if issues with new revisions of a service are encountered, rolling back production to previous revisions of the service will likely result in undesired downtime (Skowronski et al).

With the understanding that their deployment technique must change, ChartHealth’s technical leadership undertakes a review of the modern deployment techniques and finds a plethora of options.

2. Modern Deployments

In the context of microservice architectures, the goal of modern deployment practices is to achieve small and frequent service revision releases which result in zero downtime. While each of the many unique deployment techniques share this characteristic, the approaches can be distinguished by their tradeoffs.

With the understanding that a modern deployment strategy must be employed to facilitate deployment of revisions on a service-by-service basis, ChartHealth begins exploring the landscape of modern deployments. The company specifically seeks to understand which of the deployment types are most commonly employed. ChartHealth contacts multiple companies which provide deployment solution products and requests usage data from those suppliers.

Bar graph depicting Vamp and Harness modern deployment usage surveys.

The above chart was compiled from customer surveys conducted by two market-leading providers of Continuous Integration and Deployment (CI/CD) products: Harness and Vamp. Both surveys indicated that the four most common deployment techniques between the customer bases of both companies are: big bang, rolling, blue/green, and canary(Vamp)(Harness).

2.1 Rolling Deployment

Rolling deployments introduce service revisions by incrementally replacing the current production service code on each of multiple nodes within the architecture. This type of deployment is only applicable if there are multiple instances of the same service operating in production. An interesting, and perhaps desirable, aspect of rolling deployments is that the current production and new revision of a service can coexist simultaneously for extended time periods (Çeliku 12). Rolling deployments can also target specific user groups according to region, IP address, and a variety of other traffic differentiators.

PROS CONS
Rollout status transparency and control Extended deployment timeframe
Risk limited to a subset of users Not suitable for all-or-nothing changes

2.2 Blue/Green Deployment

Blue/green deployments require two environments: blue and green. The first environment, green within the above diagram, is the current production environment. The second blue environment is an almost exact replica of the green (production) environment. The only difference between their content is the revision of the target service. After the newly introduced blue domain has completed initialization, a network switch or load balancer routes 100% of the incoming network traffic to either blue or green environments exclusively.

As the deployment progresses, the traffic router redirects traffic to and from the blue or green domains on an as needed basis. At any point in time, only one of the deployments is processing incoming network traffic. The traffic router can maintain routing to the blue environment if the service revision performance is acceptable. At this point, the blue environment effectively achieves production status and the ‘older’ green domain can be destroyed (Fowler).

PROS CONS
Instantaneous rollout and rollback Duplicate production resources required
Simple disaster recovery Entire user-base exposed to service revision issues

2.3 Canary Deployment

This technique involves introducing a revised instance of a service, referred to as the canary, into the same production environment as the current production version of the service. A load balancer routes a minority of the incoming network traffic to be served by the canary instance. During the canary deployment, a majority of the network traffic continues to be served by the production service. The instance is continually analyzed such that any newly introduced issues can be detected without impacting the majority of the user base (Çeliku 13). A small amount of incoming network traffic is directed to the canary instance initially, data on its performance is collected and analyzed, and the amount of traffic diverted may be increased or decreased based on the analysis (Sato).

The load balancer can divide incoming traffic based on many different criteria including: sticky sessions, geographical region, packet addresses, packet headers, etc. This allows for the canary service to be targeted at specific users or user groups despite executing within the production environment.

PROS CONS
Minimal risk exposure to user-base Onerous configuration
High-fidelity canary analysis Extended deployment timeframe
Disaster recovery can be complex

3. Choosing the Right Deployment Strategy

First and foremost, a deployment strategy should both align with and serve business goals. For example, a service owner may prioritize zero downtime when rolling out service updates because customers require access to medical information regardless of the time of day. Some of the other considerations may include; cost, engineering skill set, available testing tools, potential schema changes, and others (Google SRE).

Upon reviewing the above described deployment methods, disagreement arose between the ChartHealth developer teams. The team responsible for the Stats service favored the Big Bang approach. The team responsible for the Trends service felt that Blue/Green deployments were more suitable for the development and release cadence of their service. The medical records team, chiefly concerned with data integrity, rejected all deployment types except canary deployment.

A major benefit of adopting a microservices architecture is the decoupling that can be achieved between services. Decoupling between services can be expressed in terms of how the service is developed, implemented, and deployed independent of one another (Indrasiri). Thus, although each of the ChartHealth development teams disagree on which deployment technique to utilize, each can be satisfied by adopting the deployment technique of their choosing.

Upon reviewing common modern deployments, the ChartHealth EMR service development team elected to adopt canary deployments. The deciding factor was that medical record data integrity is paramount. The development team is first and foremost concerned about minimizing risk of issues being presented by new service revisions which could result in incorrect or missing patient medical data.

3.1 Canary Analysis

The most encompassing definition of canary deployment may be:

deployment which consists of gradually shifting the network traffic [within the production environment] from the old application version to the new application version (Çeliku 13)

This language is very broad but may be the only definition which satisfies the myriad implementations across companies. Some organizations implement simplistic deployments which are little more than automated traffic splitters (Mitreski). Others divide HTTP traffic such that individual user sessions are handled by the canary or production service revision exclusively (Tucker). Some implementations combine canary release with other deployment techniques such as Blue/Green (Ehmke).

While there are many axes over which canary deployments can be divided, there are two groupings based on historical progression of the technique (Çeliku 2) (CD Foundation). While no formal terminology exists for this variation in canary analysis techniques, these implementations can be referred to as manual canary analysis and automated canary analysis.

3.1.1 Manual Analysis

From inception, canary deployment required human intervention at one or more stages of the process. This manual approach required engineer review of data representing the health of the deployment. In order to achieve this ‘manual analysis’ approach, tools which summarize canary performance metrics and/or event logs for human consumption are required. After review, the engineer determines whether to continue or halt the canary deployment. Considering this approach as manual canary analysis, the method was the only available until recently.

3.1.2 Automated Analysis

Eventually, statistical techniques improved such that continuous delivery of canary deployments could be achieved without human intervention (Graff). While automation of manual canary analysis had simply been a matter of implementation, the statistical techniques had lacked the fidelity to properly determine if a canary deployment was operating at a tolerable level of risk.

Automated canary analysis utilizes an automated judge. The judge performs statistical analysis of canary performance metrics stored as time-series data. The judge computes an overall score based on a comparison of the metric data against specified priorities, weights, and thresholds. The overall score is then compared against a predefined threshold to determine if the canary instance is fit to replace the current service revision. The upside of automated canary analysis is a fully automated delivery pipeline.

However, the ChartHealth EMR service developer team recognizes that automated canary analysis requires a high-degree of trust in the automated system. Given that their priority is focused on complete fidelity of medical records management, the developer team elects not to utilize automated canary analysis. However, the team wants to leverage the advanced statistical techniques utilized by automated canary analysis to more precisely understand the performance impacts of new revisions (Vamp).

3.2 Survey of Existing Solutions

Armed with a more detailed understanding of the type of canary deployment it is seeking, ChartHealth then takes inventory of the products and open source solutions which facilitate canary deployments.

ChartHealth discovers that the existing solutions can be divided into three categories which can be referred to as: platform services, orchestrator plugins, and standalone solutions.

Platform services are solutions constructed from managed services offered by infrastructure providers. While the infrastructure services provider abstracts away the low level details of infrastructure creation and management, it is the responsibility of the user (in this instance a ChartHealth engineer) to define and manage the infrastructure at a high level. The user composes and deploys the instances which house the canary service(s) and performance observability products. Of the three categories, platform services require ChartHealth’s EMR services developer team to have a DevOps skill set in addition to software engineering. Examples of canary deployments which utilize platform services include implementations by companies such as HolidayCheck (McAllister), and Klarna (Mitreski).

Orchestrator plugins are another category of solutions. These are third-party plugins such as Serverless, Argo Rollouts, and Kong Canary Release. These plugins are designed to integrate into control-plane orchestration products such as Kubernetes or Istio. The options for tailoring a canary deployments using plugin products are constrained by the limitations of the control plane itself, usually resulting in more primitive canary deployments. ChartHealth immediately recognizes that one of the major limitations of these solutions is the inherent requirement that the control plane product must be utilized as well. Control plane products are high in overhead, especially considering ChartHealth’s microservice architecture is not as complex as the environments that control planes are meant to manage.

The standalone category refers to software-as-a-service (SaaS) or open-source platforms. These products can integrate with control plane orchestrators but can also perform canary deployments independently. Canary releases are not the focus of but rather one of many features offered by these products. Some products, such as those offered by Octopus Deploy and Vamp, are paid-for SaaS solutions. The open-source solution in this category is Spinnaker. Like its SaaS equivalents, Spinnaker is a comprehensive product which offers many more features than canary deployments but requires a significant time investment to setup and maintain. Within this category, Spinnaker and Vamp are unique products in that both offer automated canary analysis.

3.2.1 Evaluating Exisitng Solutions

After surveying available canary deployment solutions, the ChartHeath EMR service developer team has decided that none are quite the right fit for their needs. They are not interested in platform services products because ChartHeath’s software engineering organization lacks DevOps experience. ChartHealth does not stand alone in this regard. A DevOps survey conducted by Atlassian and CITE Research found that 46% of companies have 3 years or less of DevOps experience, and that 37% of companies reported that the barrier to DevOps implementation is due to a lack of employee skills (Atlassian). For these reasons, ChartHealth does not believe its engineering staff is currently capable of managing a platform service solution.

The development team also decided against orchestrator plugin solutions due to the overhead of implementing a control plane product. ChartHealth’s planned three service architecture does not justify a control product more complex than the microservice architecture itself.

Ideally, the development team would choose to implement Spinnaker in order to further reduce deployment risk by utilizing the Kayenta automated canary analysis package. Spinnaker allows for configuration of canary deployments via a streamlined UI. However, similar to the orchestrator plugin solutions, the standalone solutions are too complex relative to the task of deploying the lone EMR service within a wider architecture.

While none of the other existing solutions satisfy ChartHealth’s EMR service development team, this use case is precisely that which Aria is meant to address.

4. Introducing Aria

Because the existing third-party solutions don't fit the needs of the ChartHealth engineering team, they have chosen to use Aria, a canary deployment & infrastructure management tool.

4.1 What is Aria?

Aria is an open-source, self-hosted canary deployment tool that helps developers quickly manage the lifecycle of canary deployments. Aria focuses the user on application specific configuration to achieve a high fidelity automated analysis of the deployment performance.

Aria provides a user-interface (UI) based application that allows engineers to easily configure, deploy, review status of, and destroy the above described resources within their production or staging environment of choice. As of the first iteration of the tool, Aria inserts the canary analysis infrastructure to Virtual Private Clouds (VPCs) provided by Amazon Web Services (AWS). More specifically, Aria inserts infrastructure such that network traffic is routed by an existing application load balancer (ALB) to be served by the Aria defined canary instances.

Each Aria managed deployment is composed of three instances: canary, baseline, and monitor. A specified subset of incoming production traffic is redirected from the production infrastructure to the canary and baseline instances. The monitoring instance contains tooling for both performance metrics and canary specific analysis enabling side-by-side comparison of the current and a revision to the production application.

4.2 Where Aria Fits

  • When comparing Aria with other alternatives like existing 3rd-party solutions and platform services, Aria stands out in multiple ways.

  • First, Aria shines by self-provisioning all the necessary resources needed to achieve advanced statistical analysis utilizing Kayenta. Furthermore, Aria deploys monitoring services Prometheus and Grafana, allowing users a clear view into the current state of the deployment.

  • Second, Aria was built with accessibility in mind. As opposed to other canary deployment tools that have an overwhelming amount of configuration and unclear documentation, Aria displays a simple single page UI that facilitates deployment creation and configuration. The UI focuses on meaningful configuration analysis and select targeting of a subset of the user-base per deployment.

  • Third, Aria is open-source. Companies that want to experiment with canary deployments may not want to invest in third-party solutions which can be quite costly or cannot be customized.

Taking all these together, Aria, being easy to use, free, and abstracting much of the complexity that comes with a canary deployment, is a perfect fit for a small company like ChartHealth that wants to get its feet wet with canary deployments and advanced canary analysis. Being able to create and destroy all of the resources necessary to deploy a canary with just a few clicks allows developers to explore canary deployment and analysis without having to invest too much effort and time.

4.3 Why Choose Aria?

Deploying infrastructure and configuration in support of advanced statistical analysis of canary deployments is both tedious and complex. To implement this approach, developers must be concerned with traffic splitting rules; how to configure the traffic weights, and how to set up traffic monitoring and analysis. In addition to configuration, engineering teams must also provision the infrastructure associated with the deployment and setup observability tools in concert with that infrastructure. With Aria, infrastructure creation and tedious configuration is automated and abstracted behind an intuitive UI. Furthermore, if any issues arise with the canary instance, traffic can be easily removed from the canary, rolling back to the previous pre-canary stable environment. With these tasks addressed by Aria, developers are more willing to try canary deployments and can benefit from its data analysis features.

Aria was built specifically to lower the barrier for developers that want to try canary deployments and analysis tools. It is simple to configure the canary settings and to use Aria’s straightforward UI in order to meaningfully configure and manage canary deployment infrastructure. Rolling back is easy if the new service revision is performing poorly -- with just a click of a button.

4.4 Using Aria

Aria use can be broken down into three main UI interactions which correspond to the lifecycle of an Aria canary deployment. First, a user configures the infrastructure to be deployed to the accessible environment (VPC) of their choice. Once the application has been deployed, the user has the ability to view the status and health of individual Aria deployments. Lastly, with one click, the user can tear down the Aria canary deployment such that the initial state of the environment is restored.

4.4.1 Local Environment Setup

In order to interface with AWS services, Aria requires that the user’s local development environment include the AWS CDK Toolkit, AWS CLI, and their AWS account credentials. Aria is launched by invoking npm start terminal command which subsequently launches the UI and the backend API processes. Upon launching the Aria UI, the user is prompted to select one of the locally defined AWS profiles. Aria automatically fetches all relevant AWS resource data, Aria deployments, and status information accessible by the selected AWS profile.

4.4.2 Configuration & Creation

Aria Creation GIF

Along with giving the canary a title and description. The user is presented with a Aria deployment creation form. Within the form, traffic splitting can be configured by selecting a traffic weight and rule priority. Optional settings, such as sticky sessions which enables re-routing of users to the same target group on subsequent requests, can also be toggled on/off. The user can also specify under what conditions traffic will be forwarded to the new canary and baseline target groups. Finally, the user provides the Docker image and Docker-Compose files that specify how the Docker containers should be run on the new instances. The canary can then be deployed with one click.

4.4.3 Management

Once deployed, the user can view the statuses of the provisioned instances along with their event logs in a simple and streamlined dashboard. This information can be accessed via the ‘Status’ menu item in the sidebar. The displayed table conveys the state, status, and target health of all the three provisioned instances that were created for each of the Aria deployments. Each entry within the status table also contains a link to view the event logs of that instance.

Within the summary of each deployment, accessible via the ’Deployment’ sidebar menu item, are links to the Prometheus, Grafana, Kayenta and Referee monitoring tools. These tools are automatically configured and created by Aria when the canary is provisioned, and are automatically set-up to monitor the canary and baseline instances.

4.4.4 Teardown

When the user is satisfied with the canary deployment progress, the canary, baseline, and monitoring instances can be torn down with one easy click of the ‘Destroy’ button. This button can be accessed within the deployment specific summary. After destroying an Aria deployment, the production environment is restored to its previous state wherein all incoming network traffic is routed to the current production service.

5. Technial Deep Dive

5.1 AWS CloudFormation, CDK, and SDK

Aria interacts with an existing AWS environment through the use of AWS CloudFormation (CFN), the AWS Cloud Development Kit (CDK), and the AWS Software Development Kit (SDK).

AWS CloudFormation is an infrastructure-as-code (IaC) tool allowing for the provisioning and configuration of AWS resource stacks based on developer-defined templates. AWS CDK builds on CloudFormation by allowing users to express their desired stack programmatically in terms of code, which can be then translated into a CloudFormation template. The template is then used to provision, deploy, and destroy the resources comprising an Aria deployment’s canary infrastructure.

While the CDK is specialized for deploying and destroying resources, there are some tasks for which it is not well-suited. The AWS SDK is a more low-level and flexible tool for building applications that can interact with a variety of AWS services. Aria uses the SDK for all AWS tasks which cannot be done easily (or at all) with the CDK. Some examples of these Aria tasks include fetching AWS profiles from a user’s local data, fetching resource and configuration info associated with those profiles, and configuring an existing AWS Application Load Balancer to route traffic to canary infrastructure.

5.2 Traffic Routing with AWS Application Load Balancers

A canary deployment requires that some amount of traffic be split between the production version of an application and the canary version. Traffic routing is typically accomplished via a load balancer, an application that receives network traffic and redistributes it equitably between a number of servers. Aria implements traffic routing by making use of the capabilities of an AWS Application Load Balancer (or ALB).

Aria is designed to work with AWS production environments where an existing ALB routes traffic to instances of a production application. The ALB routes traffic at the application layer and provides a rich feature set for routing HTTP/HTTPS traffic.

ALBs have one or more listeners, which are defined by a port and protocol, such as HTTP traffic on port 80. Each listener also has one or more rules. When a listener receives a piece of traffic, it applies each rule in order of priority. If a piece of traffic meets the condition to trigger a rule, a specific action (such as forwarding or redirecting the request) occurs.

Aria canary deployments create a new rule on an existing ALB listener. By setting this rule at a high priority, it can supersede the normal behavior of the listener. Traffic that triggers the Aria rule is forwarded to three target groups. One contains the production version of the application, one the baseline version of the application, and one the canary version of the application (the baseline and canary target groups receive equal amounts of traffic). Aria also configures the listener to perform a health check on a path specified by the user; this means that the canary infrastructure won’t receive traffic until it is ready and responsive. Finally, when Aria destroys a canary deployment, it destroys the rule it put on the ALB listener, which then resumes the routing behavior it exhibited prior to the Aria deployment.

The rules that Aria creates can be conditionally applied, so canary traffic can be limited to by filters such as a particular HTTP request method, HTTP header value, or URL path pattern. Sticky sessions can also be implemented, ensuring that users from a given IP address will always return to the version of the application they were originally routed to. Refreshing the browser will not suddenly take a user from the canary application to the production version.

5.3 Baseline and Canary Instances

Canary deployments require a new canary version of an application, but It is also considered good practice to deploy a baseline version of the application as well. The baseline application is identical to the production application, but instances running it are created alongside canary instances. This minimizes time-sensitive variance in performance between the old and new versions of the software. Comparing a baseline and a canary results in a better analysis than by comparing the existing production instances and the canary.

Aria deploys the baseline and canary applications to AWS EC2 instances. These instances are automatically configured with both Docker and Docker-Compose. For both the baseline and canary versions, the user prepares Docker image files (as tarball archives) and Docker-Compose files which are transferred to the instances and used to initialize the applications.

Aria also installs the application Node Exporter on each of these instances. Node Exporter is further described below.

5.4 Monitor Instance

The final resource deployed by Aria is the Monitor Instance. This EC2 instance holds a suite of monitoring and analysis tools to help the user investigate their canary application.

Like the Baseline and Canary instances, Aria also utilizes Docker and Docker-Compose to install the monitoring and analysis tooling onto the monitoring instance. These tools are each containerized and orchestrated via Docker network. Each tool is immediately ready to use. The tools fall into two categories: monitoring and analysis. Monitoring tools collect, store, and display metrics data generated by the applications in the canary infrastructure. In contrast, analysis tools process those metrics to provide insight and recommendations regarding the canary.

5.5 Monitoring with Prometheus and Graphana

Prometheus is a popular open-source time-series database used to log metrics from targets. The Prometheus server operates on an HTTP pull model, scraping metrics from targets at configured intervals by making HTTP requests. These metrics are stored on disc as time series data which may then be queried and visualized in the Prometheus front-end using the PromQL query language. Prometheus can identify targets through service discovery, and Aria automatically configures Prometheus to allow it to begin collecting metrics from deployed baseline and canary EC2 instances immediately.

In addition to Prometheus, Aria also installs Grafana, a graphical interface and dashboard builder. An Aria user may configure Grafana to display rich visualizations of multiple queries, able to compare the baseline and canary instances at a glance.

Prometheus collects metrics by pulling from HTTP endpoints. These endpoints can exist on two kinds of targets: an exporter, or an instrumented application. Aria supports both.

An exporter is a stand-alone application that collects and exposes metrics to be captured by a metrics data store. Node exporter exposes hardware and operating system metrics monitored by Unix-based systems. Aria automatically installs an exporter called Node Exporter on the baseline and canary instances. The metrics for each of those instances is then captured and stored by Prometheus.

The other kind of Prometheus target is an instrumented application. Developers may incorporate a Prometheus client library into their application, calculating metrics and exposing them on an HTTP endpoint. For example, an instrumented web application may expose metrics about the time to service a request, the number of requests over a given time, or the rate of request failure. Aria can configure Prometheus to scrape the endpoints of an already instrumented application.

5.6 Analysis with Kayenta and Referee

For analysis, Aria offers Kayenta and its graphical front-end Referee. Kayenta is a statistical judge which can be used to accurately determine the health of a canary application by comparing the metrics of the canary and the baseline instances. While a human engineer may be able to judge a canary by examining the metrics themselves, using a statistical judge helps filter signals from noise, providing a more objective assessment.

Kayenta was originally created as a component of Spinnaker, a continuous delivery platform created by Netflix. While Spinnaker allows Kayenta to automatically judge and promote canary’s to production without human intervention, Aria instead offers Kayenta as a supplemental tool for users who wish to explore advanced canary analysis.

Kayenta interacts with metrics provided by Prometheus. A user manually configures an analysis to examine one or more metrics over a given time period. They may also weigh the relative importance of those metrics (for example, a user may consider application memory usage more important than request response time). Kayenta performs the analysis and returns a numerical score and a verdict of healthy or unhealthy for the canary. Users may send requests to the Kayenta API, or make use of the Referee front-end.

6. Engineering Decisions and Tradeoffs

The development team encountered several engineering challenges while building Aria. These challenges resulted in several impactful decisions which influenced the Aria capabilities and limitations.

6.1 Infrastructure-as-Code Tooling

6.1.1 IaC Landscape

As mentioned above, infrastructure-as-code (IaC) tooling is utilized by Aria to manage its canary analysis infrastructure. Aria specifically utilizes the AWS CDK to define its infrastructure as TypeScript classes. Most IaC tooling solutions define infrastructure in a templating language format like YAML or JSON. These tools are either first-party solutions offered by major cloud infrastructure providers (AWS, GCP, etc) or are third-party tools which utilize first-party APIs to manage infrastructure (Terraform, Pulumi, etc). When researching tools the Aria development team focused on the popular products: AWS CDK, Terraform, and Pulumi.

The choice of IaC tooling was significant for Aria because a majority of the Aria server code-base is dedicated to interfacing with, addressing the limitations of, or leveraging the benefits of the selected IaC tool.

6.1.2 Aria IaC Tooling Selection

One of Aria’s initial design goals was to focus on AWS support because AWS is the leading infrastructure service provider at ~30% market share (Canalys). In light of this design goal, the Aria team initially gravitated towards CDK adoption because CDK is natively supported by AWS. After further investigation, the team found that the biggest benefit of the platform native support is that AWS utilizes the CFN service to manage deployed resources. The CFN implements its own persistent database for managing groups of deployed resources (i.e. stacks). This is a major benefit that freed Aria from the need of a database to store its own deployment meta-information. While Pulumi does support a variety of programming languages, neither Pulumi nor Terraform provide an AWS CFN interface. Another major benefit unique to CDK is that the tool allows for programmatic definition of resource stacks. This allows CDK resource definitions to utilize software mechanisms (loops, if/else statements, etc), to be testable, and leverage software versioning techniques.

However, selection of AWS CDK as the IaC tooling of choice wasn’t automatic. Compared to Pulumi and Terraform, a major downside of the AWS CDK is its lack of multi-platform support. Both Terraform and Pulumi support deploying to multiple cloud providers using a single infrastructure definition template. This benefit would have theoretically allowed future iterations of Aria to add support for additional cloud providers such as GCP and Azure without modifying the Aria resource definition template. But these providers do offer the Azure Resource Manager and Google Cloud Deployment Manager products that allow for groups of resources to be managed in a manner similar to AWS CFN stacks. Preferring to leverage resource grouping services, the Aria team accepted the additional complexity of implementing native tooling for each provider individually in favor of utilizing a single provider agnostic tool.

Another major downside is that the AWS CDK is relatively immature. The CDK is the most recently introduced of all the tooling solutions considered, where Terraform is the oldest (introduced 2014). A brief review of the AWS forums revealed CDK has many outstanding open issues. The expectation was that Terraform would be a much more stable tool due to its longevity and popularity. Although, based on the high level of AWS CDK community adoption, it was the Aria team’s expectation that AWS will continue to support the CDK. It was also acknowledged that development time would likely be longer than if Aria leveraged Terraform due to open issues or poor documentation. While the team was able to successfully develop Aria using the AWS CDK, the development of that tooling did indeed last longer than initially anticipated due to CDK limitations.

6.1.3 Addressing CDK Limitations

AWS CDK provides little support for interfacing with and configuring existing infrastructure. While the CDK does implement classes which can represent existing resources, those resources are meant to be defined within a local stack template file. This aspect of the CDK was troublesome for the team because Aria requires interfacing with resources defined outside the CDK such as: private networks, load balancers, instances, routing rules, etc.

In order to address this CDK shortcoming, the Aria team developed a library of functions incorporating the AWS Software Development Kit (SDK). These functions focus on fetching and formatting the existing resource information such that the CDK can deploy to existing private networks and route traffic to newly added Aria infrastructure.

Another limitation of the CDK is that individual CFN stacks are meant to be exclusive to a single class declaration (i.e. local template file). While this one-to-one relationship is supported by the CDK CLI commands, programmatic reuse of the class declaration isn’t supported out-of-the-box. Bypassing the CDK CLI, Aria implements a parent “wrapper” class which exposes CDK core library methods and defines additional supplemental methods. This wrapper class allows for the class defining the Aria stack to be reused without invoking the CLI commands either as a child process or via the terminal. These additional methods allow the CDK to escape the one-to-one template to stack relationship by re-synthesizing the local template file for each stack deployment and destruction on an as-needed basis. In other words, Aria overcomes the CDK restraints by defining a group of resources within a class which can be reused for each Aria deployment.

6.2 Aria Service Deployment Pipeline

As mentioned above, Aria utilizes the CDK to implement a custom service (or application) deployment pipeline. From the outside looking in, this may be a confusing design choice. AWS offers multiple services which can provision infrastructure, equip monitoring, and automatically install applications on that infrastructure. However, due to the following factors and upstream design choices, the Aria development team chose a DIY approach.

6.2.1 Metrics Exporter Contstraint

In order to collect instance level performance metrics (CPU utilization, memory usage, etc) which represents the saturation category of the four golden signals (Google SRE), there are essentially two collection approaches which support Aria’s use case: utilize AWS Cloudwatch or instrument instances with a metrics exporter. In many aspects, AWS CloudWatch would be an option superior to a third-party exporter. This is because AWS CloudWatch is a flexible service which allows on the fly definition enablement of metrics collection, supports access (IAM) roles directly, and provides metric storage by default. However, Kayenta, the canary analysis tool, doesn’t provide an interface to the CloudWatch data store. Given that Aria chose to utilize Prometheus for metrics data storage, both of the EC2 instances would need a CloudWatch agent installed. Furthermore, while CloudWatch is a more fully featured service than third-party exporters, many of the third-party exporters export more instance level metrics than CloudWatch does out of the box, by an order of magnitude. The biggest downside to third-party exporters is that each EC2 instance must be instrumented with an exporter. But in the context of Aria’s use case, the CloudWatch agent must also be installed on a per instance basis, which negates CloudWatches advantage. Thus, because third-party exporters access and exporter a larger number of metrics, providing the end user more analysis options, Aria opted not to utilize AWS CloudWatch. The instance deployment method utilized then needed to accommodate a third-party exporter in addition to installation of the service process itself.

6.2.2 Process(es) Isolation Contstraint

Utilization of a third-party metrics exporter then exposed another constraint applicable to the service deployment: isolation. In addition to installing the third-party exporter on the service instance, the development team initially planned to utilize a reverse-proxy server to export HTTP metrics. Given that both these tools require installation on the EC2 instance, in addition to the service itself, the design team’s primary concern was such that potential conflicts could arise between installed services (e.g. directory naming, ports, etc). In order to avoid this potential issue, the team decided it best to utilize containers for the service instance(s).

The Aria development team recognized that utilizing containers brings its own limitations and complications. For example, most container services build containers from images. What format of image would Aria support? AWS AMIs? Docker images? On what platform would the image be stored (e.g. AWS ECR, Docker Hub, GitLab, etc) and would that platform be privately or publicly accessible? In order to make Aria as accessible as possible for the user, the team decided that the user should package and provide their service code as an image file.

6.2.3 Applicable Deployment Services

AWS provides multiple services which can handle the deployment of user applications including; Elastic BeanStalk (EBS), CodeDeploy, Elastic Container Service (ECS), and Amazon Machine Image (AMI). When applying the three constraints listed above, none of the AWS services quite meet all of Aria’s needs. While EBS is very flexible and fully featured, it expects container images to be hosted by a storage service and doesn’t support local images. The major downside of the CodeDeploy service, from Aria’s perspective is that it has limited capabilities to deploy new infrastructure (like autoscaling groups) on which the canary services and analysis tooling executes. The major downside of ECS is that, while multiple containers can be created in the same instance, the service doesn’t allow for programs to be installed at the EC2 instance level (metrics exporter). In light of all the constraints, the Aria development team utilized CDK features to construct a custom deployment pipeline. While this approach was the most difficult to implement, it did offer the most flexible solution.

6.2.4 Custom Deployment Pipeline

The Aria development team chose Docker as the container technology used for the custom deployment pipeline because, to date, it is the industry standard for containerized applications. The natural choice would be to implement bash scripts such that the ec2 instances could pull the container images stored in a public registry. However, there are many different registries which support Docker images and the user's images may be private or not stored in a registry at all.

To address this issue, the Aria deployment pipeline is architected such that users provide a tar archive created from a docker image. Any docker image can be converted into a local tar archive file by invoking the docker save terminal command. Using this approach, once a tar archive is transferred onto the ec2 instance, it is converted back into a Docker image and a container is created from that file. However, while the CDK provides the functionality to execute a user script immediately following EC2 instance deployment, there isn’t native functionality for the transfer of files onto the instance.

This shortcoming of the CDK was addressed by transferring the user's tar files as assets to a temporary aws s3 storage bucket and then copying the tar file to the desired instanced immediately after deployment. In this way, configure the ec2 instances to download those tar files assets from the s3 bucket and convert them back into docker images and run containers from them.

7. Future Work

Going forward, there are features and functionality that we, Aria’s development team, would like Aria to support. Below are a sampling of the pending features and functions.

7.1 Modify Exisitng Routing Rules

Aria's current implementation allows users to configure the initial traffic weights of the canary instances prior to Aria infrastructure deployment. However, in order to provide even more flexibility, we would like to provide users the ability to modify the traffic routing rule(s) for existing Aria deployments.

7.2 AWS API Gateway Support

Currently, Aria deployments can only be initiated if the user’s existing infrastructure implements an application load balancer. However, many service-based architectures utilize the AWS API Gateway services to route traffic. We would like to extend Aria’s capability to add support for canary deployments where the user’s architecture utilizes an API Gateway.

7.3 GCP & Azure Support

While AWS is currently the largest of the big three cloud computing providers, Microsoft Azure and Google Cloud still command a sizable share of the market. The next major step would be for Aria to support deployments to Microsoft Azure and Google Cloud.

8. References

  1. Atlassian. “Atlassian Survey 2020 - DevOps Trends.” Atlassian, 2020. Accessed 8 December 2021.
  2. Canalys. “Global cloud services spend hits record US$49.4 billion in Q3 2021.” Canalys, 28 October 2021. Accessed 26 December 2021.
  3. CD Foundation. “Case Study: How Avast Scaled its Continuous Delivery with Spinnaker.” CD Foundation, CD Foundation, 10 2019. Accessed 8 December 2021.
  4. Çeliku, Lea. Towards Continuity-as-Code. MS Thesis. University of Oslo, 2021.
  5. Ehmke, Martin. “How We Automated Canary Analysis for Deployments | by Martin Ehmke | Adobe Tech Blog.” Adobe Tech Blog, medium.com, 19 November 2019. Accessed 8 December 2021.
  6. Fowler, Martin. “BlueGreenDeployment.” Martin Fowler, 1 March 2010. Accessed 2 December 2021.
  7. Google SRE. “Application deployment and testing strategies.” Cloud Architecture Center, Google, 5 February 2020. Accessed 8 December 2021.
  8. Graff, Michael. “Automated Canary Analysis at Netflix with Kayenta | by Netflix Technology Blog.” Netflix TechBlog, The Netflix Tech Blog, 10 April 2018. Accessed 8 December 2021.
  9. Indrasiri, Kasun. “Kasun's Blog: Microservices in Practice.” Kasun's Blog, 28 11 2015. Accessed 8 December 2021.
  10. McAllister, Neil. “Canary Releases with Traefik on GKE at HolidayCheck.” Traefik Labs, 21 May 2019. Accessed 8 December 2021.
  11. Mitreski, Mite. “Simple canary releases in AWS: how and why?” Klarna Engineering, 6 February 2017. Accessed 8 December 2021.
  12. Sato, Danilo. “CanaryRelease.” Martin Fowler, 25 June 2014. Accessed 2 December 2021.
  13. Vamp. “Vamp.io Introduces Research Report The 2021 State of Cloud-Native Release Orchestration.” The Vamp Blog, 26 January 2021. Accessed 2 December 2021.
  14. Lamm, Dan. “Continuous Delivery Insights 2020.” harness.io. Accessed 9 December 2021.

9. Presentation

10. Meet the Team

Aria was built by a small team of dedicated individuals.

Sam Graham

Sam Graham

Yue Yan

Yue Yan

Caleb Heath

Tzvi Hamerman