State of AI applied to Quality Engineering 2021-22
Section 3.1: Inform & Measure

Chapter 5 by Digital.ai

Predicting Change Risk with Reliability

Business ●●●○○
Technical ●●○○○

Listen to the audio version

Download the "Section 3.1: Inform & Measure" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Notice.*

Change risk assessment no longer has to require lengthy, tedious Change Approval Board meetings involving dozens of senior people. Instead, you can harness the power of AI – Machine Learning to harvest the wisdom and experience that is in your historical data to automate the process of continually uncovering risk-factor insights and applying them to mitigate risk and prevent change failure.

Legacy change risk assessment and approval processes often rely on outdated rules and large committees to assess change risk and make approval decisions. Organizations face great difficulty scaling these processes to meet the demand for rapid change. This results in these following negative outcomes:

  1. High-risk changes are allowed to deploy, causing business-impacting incidents and outages.

  2. Many low-risk changes are manually assessed for risk, slowing down their deployment (thus benefits), wasting time and talented resources.

AI solutions can solve these challenges by automatically assessing change risk and predicting the likelihood that changes to production will fail or succeed. They do this by mining a wide variety of historical risk factors related to change size and complexity, technology maturity, team maturity, development practices, and quality engineering practices among others. This enables organizations to expedite the flow of changes that are actually low risk, while flagging the few truly high-risk changes for review and targeted risk mitigation.

Successfully applying AI to predict change risk requires the following capabilities:

  • Mining historical data from a variety of sources to calculate dozens of measurements of change risk factors
  • Using Machine Learning to find the patterns in the historical risk-factor data that distinguish failed changes from successful changes
  • Making the AI insights accessible to enable targeted, informed preventive action both at the holistic process/organization level and at the individual change level

These approaches are implemented in the Digital.ai Change Risk Prediction solution. 

Mining historical data for change risk factors

There are several categories of change risk factors and numerous ways to measure them as illustrated in table below.

 

Types of change risk factors and associated metrics 
Size Complexity Quality

Examples:

  • Requirements
  • Tasks
  • Artifacts
  • Effort

Examples:

  • Integrations
  • Conflicts
  • Dependencies
  • Participants

Examples:

  • Peer reviews
  • Code scans
  • Test coverage, results
  • Defects
Maturity Fragility Time

Examples:

  • Track record
  • Compliance
  • Process/tool adoption
  • Technology lifecycle

Examples:

  • Incidents, Outages
  • Problems
  • Business volume, criticality
  • Prior failure rate

Examples:

  • Overall lead time
  • Dev/Test cycle time
  • Idle time

 

These can be drawn from many different tools that are used to develop, test, deploy, and support changes, as shown in figure below:

Figure 2: Sources of Change Risk Factor Data

Figure: Sources of Change Risk Factor Data

A robust data engineering capability is required to extract data from diverse data sources such as those mentioned in Figure 2, integrate it, and produce dozens of change risk factor indicators from the integrated data. This provides the inputs to the Machine Learning.

A common question asked by organizations is whether they need to integrate all these different sources and types of data before they can get started. Fortunately, the answer is that you can successfully implement and drive significant benefit using just data from the IT Service Management system, which is why it is starred in figure above. For example, a large global Consumer Products company successfully implemented change failure prediction and cut their failure rate in half just using data from their Service Management system. Over time, you can layer in data from additional sources to provide deeper root-cause insights and enhance prediction accuracy.

Finding the risk factors using Machine Learning

Sifting through all risk-factor data outlined in Figure 2, in order to identify patterns that differentiate unsuccessful changes from successful ones would be hard to accomplish manually, even with the best Business Intelligence tools. Machine Learning automates this process of revealing the critical Risk-Factor insights.

Change failure prediction falls in the class of Machine Learning problems called “binary classification.” This means that the Machine Learning algorithm is learning how to classify changes into binary classes: successful versus unsuccessful. As discussed in previous chapters in this series, a variety of binary classification Machine Learning techniques are available, including Logistic Regression0, Random Forests, Gradient Boosted Trees, and Support Vector Machines. It is important to experiment with different algorithms to determine which one works best for a given implementation.

Identifying the optimal solution for a given situation requires approaches such as cross-validation and the use of a hold-out test set. Both of these strategies are based on the notion that randomly selecting a subset of historical changes which were not included in the model's training to serve as test cases.

A frequently asked question is whether poor data quality will result in “garbage in – garbage out”. The good news is that Machine Learning algorithms will identify inputs that have little or no correlation with change failure (perhaps owing to poor data quality) and will avoid them. Hence, they are able to sift through the good and bad data and extract the good.

It is critical to repeat this process of training the Machine Learning model on a frequent basis, as risk factors are always evolving. Risk factors evolve because digital organizations are continually evolving their processes, tooling and resource models among other reasons. Organizations will also take action to mitigate risk factors as they use the AI Change Risk Prediction solution which will result in those risk factors being less significant. The timing of when to re-train the model will depend on when the monitoring of the model’s accuracy in production shows that it is becoming less accurate than when originally trained and deployed.

As an example of a risk-factor insight revealed by the Machine Learning solution, a leading Financial Services company that had extensively modernized its applications and adopted DevOps practices discovered that its remaining legacy applications were experiencing failure rates over 10%. With this insight they were able to focus attention to reduce failure rates to under 4% among these formerly overlooked applications, which resulted in avoidance of annual incident-resolution and change-rework costs totaling $1.4M.

Making the AI-ML insights actionable

The greatest benefits of change failure prevention will be achieved by taking action on two fronts:

  • Removing systemic causes of change failure
  • Mitigating risks of individual changes

To eliminate systemic causes of change failure, the machine learning solution must provide insights into how the top risk factors affect change failure rates, allowing the organization to make targeted improvements to people, processes, and technology practices to holistically mitigate these risk factors. This way, the organization is not left waiting for the Machine Learning solution to alert them that changes that are about to be implemented are high risk. Instead, they are taking action holistically to remediate the cross-cutting root causes of change failure. For instance, if machine learning informs you that newly introduced configuration items have a greater failure rate, you may utilize this information to increase quality assurance and testing requirements for updates to newer configuration items.

When the Machine Learning solution flags new changes as high risk, it is vital that the change owners and governance receive particular insights into the risk factors present so they can take appropriate mitigation action. For instance, if the Machine Learning solution flags a new change as having a high failure likelihood due to the change owner's historical failure history, you can take actions such as evaluating the testing and implementation plan or reassigning oversight to a more experienced change owner.

For example, a leading global healthcare company discovered from Change Risk Prediction that Assignee Prior Failure Rate was a top risk factor for new changes failing. The prior track record of the individuals assigned to implement changes had a big impact on the likelihood of changes succeeding or failing. Now, when the Machine Learning solution flags a new change as high risk because the Assignee has a high prior failure rate, they take actions such as reviewing the implementation plan and assigning a more experienced change implementer to provide support. As a result of this and other insights provided by the Machine Learning solution, this organization expects to reduce their change failures by almost 50%.

The figure below illustrates how these two components of change risk mitigation and change failure prevention work together, both powered by the insights provided by the Machine Learning solution.

Figure 3: Turning insights into action

Figure: Turning insights into action

The dashboard depicted in the figure below enables change owners and change governance to view which changes in the queue are highlighted as having a higher probability of failing, as well as the specific risk factor values causing the high-risk forecast. Additionally, predictions and associated risk factors can be integrated directly into the change workflow via an API interface to the Change Management or Continuous Deployment system.

Figure: Example Change Failure Prediction dashboard

Figure: Example Change Failure Prediction dashboard

About the author

Joe Foley

Joe Foley

Joe Foley has more than 30 years of development and operations leadership experience with deep expertise in applying analytics and AI to drive innovation and positive business outcomes for enterprise IT organizations. Joe is currently an Insights Architect with Digital.ai and has previously held senior technology delivery leadership positions with Accenture and Aetna.

About Digital.ai

Digital.ai is an industry-leading technology company dedicated to helping Global 5000 enterprises modernize and transform their businesses to compete in today’s digital markets. Digital.ai combines leading agile, DevOps, security, testing, and analytics technologies in an advanced, AI/ML-powered platform that provides the end-to-end visibility and unprecedented insights enterprises need to accelerate business value from software investments and increase operational efficiency while reducing costs and software related risks. Purpose built to manage the scale and complexity of large organizations, the Digital.ai platform enables enterprises to align software orchestration to strategic outcomes and optimize their business around the flow of value for their customers. Learn more at www.digital.ai and join the conversation on Twitter