State of AI applied to Quality Engineering 2021-22
Section 8: Operate

Chapter 1 by Dynatrace

The Rise of Intelligent Observability in Software Development

Business ●○○○○
Technical ●●●●○

Listen to the audio version

 

 

Download the "Section 8: Operate" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Notice.*

When AI is applied to operations, quality is continuously monitored and precise root causes are identified, allowing resolution to occur in minutes, prior to adverse effects on outcomes. Accurate responses are instantaneous, automatic, and continuous.

The business motivation for modernizing software architectures, upgrading delivery models and re-platforming towards multi-cloud container orchestration platforms is often driven by the following four business goals:

  • Increasing Deployment Frequency: to enable continuous value stream delivery
  • Reducing Lead Time for Change: to enable faster time-to-market
  • Reducing Change Failure Rate: to heighten user satisfaction
  • Reducing Time to Restore Service: to increase trust

In short: Better Value, Sooner, Safer, Happier!

Based on the 2021 DevOps Report, 71% of global senior-level DevOps leaders say that a unified platform that seamlessly integrates their toolchains will be critical to scaling DevOps beyond a single lighthouse project. This report therefore confirms the emergence of “Internal Platform teams” combining DevOps practices, which speed up the value stream through delivery automation, as well as SRE (Site Reliability Engineering) practices, which ensure resiliency through operations automation. The IP teams then make these platforms available to their engineers through a self-service model. The report also highlights that the successful future of DevOps and SRE practices lies in increased automation and intelligent end-to-end cloud observability. 79% of respondents say extending AIOps beyond traditional use cases is among their top priorities!

AIOps has been the hope for many organizations to reduce the manual work needed to detect issues, find their root cause and rollout a fix. While AIOps solutions have clearly improved since the term was coined back in 2016, the survey numbers either indicate that AIOps solutions are not as widely adopted, or they don’t deliver on the promise.

The problem with Gen 1 AIOps

Early AIOps solutions were heavily focused on collecting logs, metrics, events, and even distributed traces. They did this by ingesting data from different disconnected data sources, then using machine learning algorithms to learn about behavior and the potential correlation between these data points. This early phase clearly improved machine learning algorithms and boosted big data analytics. It also worked reasonably well for the production workloads back then, since they were more static in nature, as deployment and configuration changes were counted in X-times per year. AIOps solutions could point out whether a deployment of a new major release of your ERP or back-office collaboration system resulted in a higher failure rate than prior to the update. Automatically analyzing logs made it possible to highlight root cause, which clearly helped ITOps to fix problems faster.

Today we are counting deployments no longer in years but days. We are moving towards progressive delivery models where we are not replacing a whole system with a new version, but gradually deploying a new version of individual services. Businesses are also increasingly leaning towards testing new features or changes by releasing to a subset of users first, collecting feedback, and then deciding whether to release it to the whole user population. The Gen 1 AIOps approach was simply not made for this behavior. There is not enough time for the ML algorithms to learn “what's normal”. It’s getting harder to correlate logs, metrics, events and traces, as more services are involved that all use different ways to expose their data. AIOps also always focused on solving operations problems by looking at production only, whilst changes could have already been detected as part of the delivery process before entering production.

As Delivery and Operations are adapting to the new norm, AIOps must adapt as well in order to help DevOps and SRE alike in delivering Better Value, Sooner, Safer, Happier!

AIOps Done Right: Integrated into your processes and platforms

The solution to this dilemma however is not just to update to a better version of your AIOps tooling. The solution is to include AIOps as part of your development, testing, continuous delivery, DevOps & SRE practices.

Figure: Embedded AIOps

Figure: Embedded AIOps

 

Embedding AIOps into your internal platforms ensures that AIOps automatically learns about intentional and unintentional change in behavior as part of continuous integration & delivery. AIOps can also be leveraged in performance and chaos engineering to strengthen and validate its anomaly detection. This will give you the confidence to let AIOps drive your auto-remediation in production as you have already tested and validated its automated detection of root causes in chaotic situations.

Let us cover 3 use cases in more detail.

Use Case #1 AIOps to scale and improve Delivery Automation

Most organizations have started automating their delivery with tools such as Jenkins, GitLab Pipelines, Azure DevOps or others. Integrating AIOps as part of delivery will result in better quality of code making it into production, as well as an increase in the throughput of delivery pipelines. There are two key integrations of your delivery automation with AIOps solutions to make this work:

  1. Push: Inform AIOps about any deployment or configuration changes
  2. Pull: Leverage AIOps data for data-driven delivery decisions

1: Push Deployment Information to AIOps

Most AIOps solutions provide either an Event API or they can extract deployment events from logs. One requirement is that the event should be “attached” to the monitored entity, e.g.: the process, container, host or service. Otherwise, it’s a disconnected piece of information that is hard to analyze and correlate by the AIOps solution.

Some examples of those events are:

  1. Deployment of version X.Y of service website in environment testing
  2. Load Test 123 against application online-banking in environment staging
  3. Load Balancer: switching traffic from deployment Blue to Green in environment production
  4. Restart services of application online-banking in environment staging

This context information allows the AIOps solution to connect a potential change in behavior to that executed action. If there is a negative impact on end users or your SLAs (Service Level Agreements), the AIOps solution can immediately alert about it and provide the deployment change as a potential root cause, as seen in the following screenshot:

00203-AI-for-QE-S8C1_2.png

Figure: AIOps solution alerts

Another common use case can be seen in the next screenshot, where KBI is pushing information about their automated load tests from their delivery pipeline to their AIOps solution:

00203-AI-for-QE-S8C1_3.pngFigure: KBI Load Tests

In the example above, AIOps was made aware about a specific load test that was executed. This allows AIOps to provide a “Performance Hotspot Analysis” for the time range of the test. It allows it to send potential alerts directly to the test engineers or even to the test tool while the test is still running. Furthermore, it enables AIOps solutions to also compare and provide automated regression analysis between test runs, as seen in the image below.

00203-AI-for-QE-S8C1_4.pngFigure: Automated regression analysis

In summary: Pushing events from your delivery automation gives your AIOps solution more context about the data it analyses, which allows it to provide better automated answers to quality related questions.

 

2: Pull AIOps for better Delivery Decisions

Once AIOps is aware of delivery activities, we can leverage that data to support us in making better delivery decisions. AIOps not only gives us the data in their own dashboards, but typically provides an API that extracts quality data about an individual release or test run. AIOps typically also provide an option to compare individual test results or baseline results across multiple test iterations & deployments to detect regressions. The latter use case has seen several implementations in the commercial and in the open-source space. Keptn, a CNCF project, is one of those open-source projects that integrates with AIOps as well as performance or chaos testing solutions. Keptn uses an open standard protocol for tool integrations, it leverages the widely adopted concepts of SLOs (Service Level Objectives) to evaluate data from the various tools and calculates an overall SLO quality score to support delivery decisions. In a nutshell: it automates the otherwise manual task of looking at test reports or AIOps dashboards and comes up with an easy-to-understand result.

The following visualization shows how Keptn works:

Figure: Keptn

 

Christian Heckelmann, DevOps Engineer at ERT, integrated Keptn into their GitLab Delivery Pipeline to automate the analysis of test and AIOps data and, with that single SLO Score, to automate rollout decisions. This not only speeds up delivery, but also ensures that the problems that AIOps has identified as part of the delivery process will stop the rollout or the delivery of current pipeline changes:

Figure: Keptn's SLIs and SLOs

 

In summary: Including AIOps data as part of your delivery automation allows you to make better and faster rollout decisions. As AIOps not only provides individual metrics but also automatically analyses problems and root causes. Integrating AIOps into your delivery will ensure that better code makes it into production and that developers get better and faster root cause feedback on potential issues.

Use Case #2 AIOps to Ensure Operational Resiliency

One indicator of production quality is the resiliency of your IT system to change. How does the system react to changes in load or user behavior? How does it react if one component of a system is faulty after an upgrade? How does it react if a depending system is not available?
The emerging role of Site Reliability Engineer is focused on ensuring reliability and resiliency of IT systems. SREs try to automate former manual operational tasks to ensure continuous resiliency, availability, and health of your systems. AIOps solutions – when integrated into the delivery and operations processes automation – can support SREs in a big way with their task.
In our first Use Case we showed how to integrate AIOps with your delivery automation. This information allows AIOps solutions to better detect root causes in case a system behaves abnormally. If there is an ongoing load test in production, the AIOps solution can alert those teams in case they are affecting the overall system health. If the latest rollout of a new version of a critical backend service is causing a high failure rate, then the AIOps solution can alert the responsible application teams about this issue, by giving them the root cause and detailed impact information.
The overall goal of SREs must be to automate as many remediation steps as possible. A great way to get started is by leveraging the detailed root cause and problem evaluation information of AIOps solutions, as it can be seen in the following screenshot:

Figure: Problem evaluation information of AIOps solutions

 

These AIOps details are great for postmortems to analyze where the “critical path” of resiliency in the system is, what actions happened until the system became unstable, and what actions solved the problem. This information can be communicated back to architects and engineers to build more resiliency into the system itself.
On the other hand, this information can be used to start automating runbooks so that the next time a problem like this happens, the remediation is done automatically. This allows for shorter system downtimes. A complete workflow could look something like this:

Figure: Workflow

 

In summary: AIOps solutions can help SREs build better automation in operations to ensure higher resiliency and availability of their systems. When AIOps is integrated right, it also results in better targeted notifications to groups that really need to act or need to be aware of issues.

Use Case #3 Shift-Left AIOps to enable Test Driven Operations

In the previous Use Case, we talked about a reactive approach to automating operational tasks. You wait for a problem, you learn about the root cause and then you try to make sure it either doesn’t happen again or you build an automation to automate the remediation the next time the problem comes around.

This approach can be equaled to engineers only writing tests for their code when users complain about features not working as expected, and not before. Fortunately, nowadays most engineering teams do test driven development where the tests are created first and code can only pass if the tests are green after every code change. AIOps can help us achieve something similar that I like to call “Test Driven Operations”.

SREs know how systems should behave in production under different workloads and conditions. SLOs are commonly used to validate and report behavior in production and auto-remediation scripts are used to keep the system available.

The more pro-active and “Shift-Left” approach would be to test the resiliency and auto-remediation scripts before entering production. Auto-Remediation scripts are also code, and that code should be treated equally to business code. This means that there must be a clear definition of what that auto-remediation code does, e.g.: restarting a service. There also must be a way to validate the intended outcome, e.g.: restart happens with an allowed downtime of maximum 10 seconds. Finally, that code must be triggered by the AIOps solution in case there is a problem with that service – so – we also need to test and validate if the AIOps solution can pick up problems in our “Test Driven Operations” environment.

In practical terms, this approach can be implemented using a combination of performance and chaos engineering tools in a pre-production environment that is also monitored with an AIOps solution. Keptn, the open-source project referenced earlier, provides this use case. Keptn orchestrates the execution of load tests, injects chaos, can execute, and validate auto-remediation scripts and validates the desired outcome through the AIOps integration.

Figure: AIOps integration

 

In summary: AIOps is great to ensure healthy systems in production. The true power comes when integrating it into the engineering process. Just as Test-Driven Development has led to better quality code, Test-Driven Operations will lead to better and more stable production systems.

Summary: How you can leverage AIOps Right

AIOps has come a long way but the journey is not over yet. The next step is to go beyond supporting IT Operations for better alerting on production issues. AIOps has to Shift-Left and be integrated into delivery automation. AIOps has to change the way engineers and architects design and build their code. AIOps has the chance to be what Test-Driven Development was to the Agile movement. It enables SREs to start with Test Driven Operations which ensures code is not only functionally correct, but also meets all resiliency criteria for production.

About the author

Andreas Grabner

Andreas Grabner

Andreas Grabner has 20+ years of experience as a software developer, tester and architect and is an advocate for high-performing cloud scale applications. He is a regular contributor to the DevOps community, a contributor to the CNCF Keptn project, a frequent speaker at technology conferences and regularly publishes articles on medium.com blog.dynatrace.com.

About Dynatrace

Dynatrace provides software intelligence to simplify cloud complexity and accelerate digital transformation. With automatic and intelligent observability at scale, our all-in-one platform delivers precise answers about the performance of applications, the underlying infrastructure and the experience of all users to enable organizations to innovate faster, collaborate more efficiently, and deliver more value with dramatically less effort. That’s why many of the world’s largest enterprises trust Dynatrace® to modernize and automate cloud operations, release better software faster, and deliver unrivaled digital experiences.

https://www.dynatrace.com

 

 

 

Dynatrace_Logo_color_positive_horizontal.png