State of AI applied to Quality Engineering 2021-22
Section 3.1: Inform & Measure

Chapter 3 by Capgemini

Code Quality Analytics

Business ●○○○○
Technical ●●●●○

Listen to the audio version

Download the "Section 3.1: Inform & Measure" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Policy.*

Code quality improvements can be achieved by applying analytics and machine learning models on top of the content and meta data of code along with data coming from other related repositories. While these models assist us in making faster decisions, they are not a substitute for human intellect.

The quality of source code is critical to the creation of software, and its ongoing monitoring is an essential responsibility in the project. Various organizations have adopted a set of practices throughout the last four decades. The approaches have remained constant but have been enhanced via ongoing learning and adaptation to new tools and technology. The strategies for analyzing the code quality that are extensively used by the majority of businesses are depicted in Figure below.

  • Pre-commit code quality procedures are those that are used prior to each commit, i.e. thoroughly verifying software code before it is changed in a source code repository.
  • Post-commit code quality approaches are those that are employed after the commit
Figure 1: code quality analysis techniques


Figure: Code quality analysis techniques

Activities such as manual impact analysis, identifying the right set of test cases to execute, inadequate code reviews, and unit testing, and the non-deterministic nature of system are the key reasons for the testing delay.

Opportunities to overcome the mainstream
code quality technique challenges

There is considerable room for improvement to overcome current limitations, most notably:

  • The absence of collective analysis approach, the lack of bandwidth, and tools often result in a decision-making process that is erroneous.
  • The benefits of software repository mining are mostly untapped. Trends and patterns are therefore not mined or underutilized, preventing us from providing quick impact analysis recommendations.
  • Dynamic code analysis (DCA) technologies generate enormous amounts of data, some of which may contain significant amounts of noise. Suppressing noise and extracting valuable information from DCA output would increase the detection of exploitable vulnerabilities.
  • Subject matter experts' knowledge is restricted to a few specialized areas, and knowledge is dispersed and inconsistent across different SMEs.
  • We are not making appropriate use of available computing power, growing data analytics, and machine learning developments.

Data sources around source code

Over the course of years, a massive quantity of data is generated, stored, and preserved within and around the program code. The following table summarizes the types of data found in and surrounding source code:

  • Behavioral data (commit history)
  • Attitudinal data (code, design review)
  • Interaction data (dynamic code analysis, code check in comments, defect notes)
  • Descriptive data (code quality attributes like defect density, etc., self-declared information).

Data Sources:

  • Code Smells
  • Defect Data
  • Review Comments
  • Developer
  • Build Management System
  • Source Code Content
  • Dynamic Code Analysis Data
  • Static Code Analysis Data
  • Requirement, Use Cases
  • Unit Test Results

The above list is a possible collection of data sources centered on source code that offers an untapped goldmine of insights. Most code quality techniques get insights from the data's content. Content-specific quality analysis is language-specific, person-dependent, time-consuming, and provides only a limited amount of insight. Along with content, non-content attributes generated from meta data around code assist in obtaining a holistic picture of code quality. For instance, commit logs, which contain behavioral data about source code, enable the extraction of insights such as change trends and developer familiarity with the updated code. These features (characteristics or properties are referred by features or input variables in data science) are simple and capable of determining the quality.

Existing techniques use any of the data sources in isolation. Factoring features referred as code quality features derived from multiple repositories that provides recommendations with higher accuracy.

Figure 2

Figure: Features

Code quality features can be categorized into following categories:

  • Change metrics: features to determine how a code component is changing over a release, e.g. consecutive changes made late in the release cycle, have high probability of being defective.
  • Churn metrics: features to determine magnitude of changes happening to code components, e.g. extreme changes happening to code component, have high probability of being defective.
  • People metrics: features to determine who is changing the code component over a time interval, e.g. ‘too many cooks spoil the broth’ – if a code component is touched by many developers, it will have high probability of being defective.
  • Temporal metrics: features to determine temporal aspects of when are the changes happening to a code component, e.g. changes made late in the release cycle, have high probability of being defective.
  • Code quality factors: features to determine complexity and code quality violations using static code analysis, e.g. code components with high complexity and severe violations, have high probability of being defective.
  • Code smells: features to evaluate code smells that lead to defective state of a code components, e.g. high density of severe code smells, have high probability of being defective.
  • Process metrics: features to determine process effectiveness & efficiency, e.g. process gaps of shortcuts preferred during software development may lead to code quality issues.
Figure 2: Code Quality Features

Figure: Code Quality Features

Analytics and machine learning for code quality

A system based on machine learning and deep learning employs algorithms which allow the model to learn from data. The trained models are then used to make intelligent decisions automatically based on identified relationships, patterns, and knowledge in the data. Algorithms like Random Forest, Decision Tree, GBM, GLM, KNN, k-means clustering, RNN, LSTMs etc. are used to extract knowledge, patterns, and relationships. Prior experience in mining enables the identification of prevalent patterns; these learnings or patterns can then be employed as predictors of future outcomes.

Historical patterns identified and accessible using the characteristics outlined above can aid in the development of a model capable of recognizing defective and dangerous code components inside a source code repository.

In the absence of an automated solution, architects and subject matter experts (SMEs) may manually analyze code quality using a set of heuristics, mental shortcuts. These heuristics can be mined manually or automatically using a heuristics mining framework and an analytical model. The objective is not to supplant SME intelligence, but rather to supplement and accelerate SME decision-making.

Code Quality Analytics

Code quality analytics is an intelligent solution based on analytics and machine learning that utilizes data from multiple repositories such as source code, code reviews, project management, defect management, requirements management, and static code analysis to identify and forecast incoming riskier source code file changes in a repository based on more than 50 product risk feature indicators. It enables the extraction of patterns from historical data and the establishment of connections between several siloed repositories.

In code quality analytics we,

  • Factors data from multiple repositories such as source code, code reviews, project management, defect management, and static code analysis
  • Establish the linked data across multi-silo repositories
  • Extract patterns from historical data
  • Derive 50+ change, churn, temporal, code smells, violation, and people metrics which are unique value propositions.
  • Implement self-learning riskier release code/component predictor model using analytics and machine learning
  • Code Quality Analytics model can be invoked on-demand or scheduled to get the real-time recommendations on riskier components and files.
Figure: High level overview

Figure: High level overview

Challenges in Code Quality Analytics

There are several significant issues associated with implementing analytics and machine learning for code quality. The difficulties encountered can be summarized as follows:

  • Data for extracting key input features not available.
  • Inadequate data for performing code quality analytics.
  • Data not accessible for performing analysis.
  • Data quality not up to the mark, for analysis.
  • Data linkage unavailability due to process gaps.
  • Noisy data, difficulty in determining outliers.

A mature project, with a strong release pipeline and the use of industry-standard technologies, is often free of the aforementioned difficulties.

Impact - Automated Impact Analysis

Code quality analytics enables the identification of high-risk components and files in a source code repository. Any update to the source code has an effect on the code's quality, features, and functionality. Analyzing the effects of incoming changes and making suitable modifications to the testing strategy manually is extremely time consuming and may result in testing being delayed or may not provide optimal feature test coverage for the new incoming release.

Analytics-assisted automated impact analysis employs trace, dependency, experiential, and empirical methodologies to generate change and impacted areas in order to forecast test cases that span the highlighted regions. It aids in the identification of changes to requirements, source code, and test cases inside a continuous integration environment. Each change initiates an impact analysis model that identifies the change and the locations impacted. This allows for the prediction of test cases that span the above highlighted areas and feeds continuous integration testing with the test plan. Additionally, the impact analysis model learns from the outcomes of the test plan and adjusts the proposed run time as necessary.

In 2018, a global automotive components manufacturer was evaluating the adoption of the open source Automotive grade Linux codebase[1] . The traditional approach of relying only on static code analysis and unit test results as indicators of code risk was regarded as too risky. They opted for a Capgemini platform that centered on identifying code risk based on meta data (commit record).

They key challenge was that the codebase was complex with 104K files, 12K Developers contributing all over the globe, 600K+ commits and counting. The customer quality engineering group wanted to automate the impact analysis so that they could determine the inherent risk. The execution approach is detailed below.

Execution Approach:

  • Data Sources Factored:
    • GIT- Source code
    • Gerrit – Review comments
    • JIRA – Defects, Dev profiles
    • C++ Test – Code Quality analysis. [Compliant with MISRA standards]
  • Data Collection, Data Preparation & Exploratory Data Analysis.
  • Feature Engineering to derive Code Quality Features.
  • Model to come up the probability of each file being defective using Machine Learning Models.
  • Auto classifier for Risk classification.


Improved code quality can be accomplished by using analytics and machine learning to the code's content and meta data, as well as data from other related sources. Recommendations based just on content mining have reached relatively low accuracy (between 20% and 30%), but recommendations based on code, meta data, and other relevant sources have achieved 75–90% accuracy. In general, applying analytics and machine learning to code and code-related artifacts enables enterprises to provide faster and higher-quality code. The recommendations generated by automated impact analysis improve the quality of the code in the following ways:

  • Developing and running small unit tests for the risky components/files
  • Multi-layer code reviews done for hot files
  • Dynamic build verification tests
  • QA teams focus test on the high-risk areas
  • Corresponding test cases execution cadence have increased for hot files
  • Improved quality with early deduction of defects.

Recommendations generated by analytics and machine learning models aid teams in making faster decisions, but they are not a substitute for human intellect. Frequent auditing of the models' effectiveness using statistical measures such as precision, recall, and F-measure, as well as upgrading models with missing heuristics, would help ensure the models' correctness.

Additionally, automated impact analysis enables the identification of areas that will be impacted by incoming changes and the optimization of the test plan appropriately, ensuring faster feedback and maximum coverage. It enables the company to meet time-to-market objectives while delivering the highest-quality products.

About the authors

Vivek Jaykrishnan

Vivek Jaykrishnan

Vivek Jaykrishnan is an enterprise test consultant and architect with extensive experience. He has over 22 years of experience leading Verification and Validation functions, including functional, system, integration, and performance testing, in leadership positions with reputable organizations. Vivek has demonstrated success working across a variety of engagement models, including outsourced product verification and validation in service organizations, independent product verification and validation in captive units, and globally distributed development in a product company. Additionally, Vivek has extensive experience developing and implementing test strategies and driving testing in accordance with a variety of development methodologies, including continuous delivery, agile, iterative development, and the waterfall model. Vivek is passionate about incorporating cognitive intelligence into testing and is also interested in exploring the frontiers of IoT testing.

Vivek Sejpal

Vivek Sejpal

Vivek is an passionate data scientist with more than five years of experience in analytics and machine learning. He is currently engaged in the design and development of intelligent data-driven assets for quality engineering applications. Vivek works with customers on end-to-end engagements to demonstrate the value of various Intelligent asset offerings across domains/industries. Additionally, Vivek contributes to the research and development of new intelligent assets for the purpose of resolving critical business problems through a data-driven approach that is consistent with business requirements.

About Capgemini

Capgemini is a global leader in partnering with companies to transform and manage their business by harnessing the power of technology. The Group is guided everyday by its purpose of unleashing human energy through technology for an inclusive and sustainable future. It is a responsible and diverse organisation of 325,000 team members in nearly 50 countries. With its strong 55 year heritage and deep industry expertise, Capgemini is trusted by its clients to address the entire breadth of their business needs, from strategy and design to operations, fueled by the fast evolving and innovative world of cloud, data, AI, connectivity, software, digital engineering and platforms. The Group reported in 2021 global revenues of €18 billion.

Get the Future You Want  I




Capgemini logo