State of AI applied to Quality Engineering 2021-22
Section 3.1: Inform & Measure

Chapter 4 by Capgemini

Code analytics extened to impact based testing

Business ●○○○○
Technical ●●●●○

Listen to the audio version

Download the "Section 3.1: Inform & Measure" as a PDF


By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Policy.

Artificial Intelligence (AI) helps engineers in determining which portions of code remain untested and provides insights to the testing team to build additional test cases in order to achieve 100 percent code coverage. It assists in ensuring that applications are not released with untested code and automatically identifies impacted functional blocks or test cases based on coverage data. The Hamming distance and K-means clustering algorithms are respectively used to minimize test case redundancy thereby increasing test case effectiveness and optimizing the test suites. In the process, the impacted test cases based on code changes can be identified.

In the last chapter, we discussed how applying analytics and machine learning on code and code-related artefacts helps organizations to achieve quicker releases and better code quality. Code analytics give us the ability to map test cases to code and optimize test suites based on their code coverage. In contrast to functional coverage, code coverage measurements will assist us in identifying untested lines of code. It is an efficient method of ensuring that untested code is not deployed in production. It aids in the development of new test cases in order to expand functional coverage. This strategy will enable us to change the test suite in such a way that each line of code is tested, thus increasing coverage prior to application release. When compared to applications with low code coverage, those with high code coverage have a lesser likelihood of harboring undetected software defects.

The purpose of this experience-based chapter is to further illustrate how Artificial Intelligence (AI)/ Machine Learning (ML) can be used to enhance code analytics.

Solution Overview

The solution is comprised of four phases that must be completed in order.

  1. Establish the test case to code mapping
  2. Identifying duplicate testcases using hamming distance algorithm
  3. Identify the impacted test cases based on code changes
  4. Test suites optimization using k-means clustering

Establish the test case to code mapping 

The Figure below illustrates how a functional test case is mapped to code. Generally, the traceability of requirements to test cases and defects is available. We are mapping the code to functional test cases in the suggested approach, which will offer us with a 360 degree traceability. The code coverage information is passed into the AI/ML system to find redundancies and optimize the test cases further.

Figure: 36O degree traceability

Figure: 3600 traceability

The figure below illustrates the process of translating code to functional test cases and generating the coverage report.

The code analytics solution runs through each line of the code through package ->  class -> method and assigns a unique number and hamming distance (explained in section 2) to each of the test cases (based on their unique package / class / method combination). This assists in identifying the unique lines of code that were touched (or left alone) during the execution of a test case.

The code analytics solution generates a dynamic code coverage report which is used to associate each test case id with the lines of code covered during execution. This information is then used to compute unique lines of code and overall test case effectiveness in terms of code coverage.

Figure: Workflow of Functional Test case to Code Coverage

Figure: Workflow of Functional Test case to Code Coverage

The test suite is executed in accordance with the test strategy/plan to ensure that all functional requirements are met. Using code analytics solution, the overall code coverage for the test suite is measured.  If the code coverage is not meeting the defined goals, additional test cases are created to increase the code coverage. 

 The effectiveness of the test cases is determined by the code coverage data. The unique lines of code are identified by code covered with each test case by removing the overlapping / duplicate lines covered by other test cases. Test case effectiveness metrics can be used to prioritize the test cases. Test cases with a low or no effectiveness (duplicate coverage) might be eliminated for test suite optimization.

Test case effectiveness metrics
Total LoC 1202  
Code Coverage 317 26%
# TCs 8  
Coverage ID TC ID TC Name Line Covered Unique Lines (TC) Unique Lines (Test Set) Test Case Effectiveness
1 TC_1 TC1_Admin_Login 94 94 94 100%
2 TC_2 TC2_Add_New_Bank 0 0 0 0%
3 TC_3 TC3_Add_New_Employee 169 150 95 56%
4 TC_4 TC4_Emp_Change_Password 41 41 39 95%
5 TC_5 TC5_Admin_Change_Password 69 65 56 81%
6 TC_6 TC6_Find_Branch 32 32 28 88%
7 TC_7 TC7_Misc_Menu_Validation 32 32 0 0%
8 TC_8 TC8_Overall 299 276 5 2%

Identifying duplicate test cases using hamming distance 1 algorithm

Functional test case to code coverage mapping data can be passed to hamming distance algorithm to identify the duplicate test cases. The hamming distance1 between two binary strings is defined as the number of bit positions that differ between them.

 Typically, the hamming distance is applied to binary strings; however, in this case, we attempt to find duplicate test instances, as Pang et al2. advise in their publication.  A test case tc1 is assumed to be a duplicate of the test case tc2 if the two test cases touch and cover the same lines of code being tested. Let’s assume that the code under test contains ten lines. As illustrated in Table 2, each line is represented by a square box.

We assign a value of 1 to each line that a test case touches. If the test case does not cover a line, the position of the test case is set to 0. A scenario is depicted in Table 1.

The numbers 1 to 10 indicate the lines of code and tc1 to tc4 indicate the four test cases. From the Table 2, we see that executing the test case tc1, the lines 1,3,5,7 are covered. Similarly, we may understand the results for the other test cases. By assembling these 1s and 0s into a binary string, we may calculate the hamming distance. As evidenced by their binary strings, tc1, tc2, and tc3 all cover different lines of code, however tc4 covers the same lines of code as tc1, indicating that these are two identical test cases. The hamming distance between tc1 and tc2 is 1111111000, which adds up to 7, whereas the hamming distance between tc1 and tc4 is 0. As a result, tc1 and tc4 will be considered duplicate test cases. This method identifies all duplicate test cases in the test suite and prevents any unnecessary execution effort.

  1 2 3 4 5 6 7 8 9 10
tc1 1 0 1 0 1 0 1 0 0 0
tc2 0 1 0 1 0 1 0 0 0 0
tc3 0 0 0 0 0 0 0 1 1 1
tc4 1 0 1 0 1 0 1 0 0 0

Table 2 - Illustration of using Hamming Distance to identify the duplicate test cases


Identify the impacted test cases based on code changes

Code Analytics maps the test cases with code and optimizes test suites depending on their code coverage. It will identify the impacted test cases based on the mapping. Change based impact testing can be performed based on code coverage data. Details on the mapping of test cases to lines of code / methods are available. Coverage metrics for the current build are baselined and compared to those for the new build to identify impacted test cases. Only the impacted test cases can be executed to validate the recently changed code. Change based impact testing helps to identify the subset of test suites which is required to validate the recent code changes. This helps to optimize the test case execution effort. 

Test suites Optimization using k-means clustering

K-means clustering is the most popular unsupervised algorithm. After choosing a number of clusters (K), we randomly select K points as the centroids of each of these clusters. The remaining points are assigned to the cluster that is closest to the centroid. Once all points have been allocated to the cluster, the centroid is recalculated, and the procedure is repeated until the centroids of freshly formed clusters remain constant. As Pang et al. indicate in their article, we propose using k-means clustering to distinguish effective and ineffective test instances. Effective test cases are those that have been impacted by the new release's changes and that when executed, reveal the new release's flaws. The objective here is to perform only the most effective test scenarios. Consider the following Table 3.

  1 2 3 4 5 6 7 8 9 10 P/F
tc1 1 0 1 0 1 0 1 0 0 0 P
tc2 0 1 0 1 0 1 0 0 0 0 F
tc3 0 0 0 0 0 0 0 1 1 1 P
tc4 1 0 1 0 1 0 1 0 0 0 P

Table 3 - Test cases run against 10 lines of code with Line 9 changed 

Table 3 is identical to Table 2, except for the addition of a column indicating whether the test case passed or failed. tc2 and tc3 will automatically be included in the collection of effective test cases, as tc2 failed and tc3 covers the line where the change is being made. This creates the first effective set of test instances from which the centroid may be determined, whilst the centroid of the non-effective set is set to 0. We pick the remaining two test cases and calculate the hamming distance between them and each of the test cases in the effective and non-effective sets. The test case is assigned to the cluster that has the shortest distance. The procedure can be repeated until the centroid does not change. As a result, we will have two clusters of test cases that are effective and ineffective. This technique aids in the optimization and prioritization of test suites by classifying test cases into effective and ineffective clusters. 


The following are some of the perceived benefits of code coverage-based test coverage and impact analysis.

  • Enable Shift Left for early defect detection and improved code quality 
    • Dynamic code analytics 
    • Functional test coverage in terms of code
  • Increase test coverage to prevent defect leakage 
  • Optimized regression test execution for cycle time reduction 
  • 360º Traceability for improved change management  
  • Focus on business critical / impacted features to reduce risk 
  • Test case deduplication based on code coverage

About the authors

Arunagiri Sakkaraipalam

Arunagiri Sakkaraipalam

Arunagiri Sakkaraipalam is having overall 20+ years of experience in Product Management, Project Management, People Management, Process, Quality Engineering and Leadership across SDLC life cycle in various Technologies and Domain. Having end to end responsibility for conceptualizing, planning and driving product releases from initiation to closure. Expert in product management with a proven track record of managing multiple releases of high-profile products in the organization. Provided leadership, insight, and vision to translate business requirements into effective business solutions. As part of Innovation Champion role, consolidated all the practice innovations/ideas in the innovation platform. Played an effective role to conduct review of the technology landscape focusing on new products, tools, trends, papers and analyst’s reports.

Venkatesh Babu

Venkatesh Babu

Venkatesh Babu is a technology leader, with 22+ years of experience in JEE, .NET, Mobile, Cloud, Automation, SMAC, IoT, RPA, AI/ML, Digital technologies - architected, designed and delivered enterprise solutions for global clients. He is working in the Research & Innovation Group and passionate about Emerging Technologies, Focus areas include Cloud, Artificial Intelligence & Machine Learning IoT, Gamification, Gesture Recognition, Augmented Reality, Blockchain, Big Data, Microservices, Design thinking, Solution Architecture & Consulting, Application Integration, Product Engineering, Wearables, Crowdsourcing and Technology evangelization.

About Capgemini

Capgemini is a global leader in partnering with companies to transform and manage their business by harnessing the power of technology. The Group is guided everyday by its purpose of unleashing human energy through technology for an inclusive and sustainable future. It is a responsible and diverse organisation of 270,000 team members in nearly 50 countries. With its strong 50 year heritage and deep industry expertise, Capgemini is trusted by its clients to address the entire breadth of their business needs, from strategy and design to operations, fueled by the fast evolving and innovative world of cloud, data, AI, connectivity, software, digital engineering and platforms. The Group reported in 2020 global revenues of €16 billion.

Get the Future You Want  I




Capgemini logo