State of AI applied to Quality Engineering 2021-22
Section 3.2: Inform & Interpret

Chapter 2 by Capgemini

NLP For duplicate and orphans assets

Business ●○○○○
Technical ●●●●○

Listen to the audio version

Download the "Section 3.2: Inform & Interpret" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Policy.*

In the preceding chapter, we covered the breadth of NLP's approaches and tools. NLP has a wide range of practical applications in various fields of quality engineering. This includes – but is not limited to – automating test procedures using natural language comments, understanding the inherent code risk from commit record comments, and assessing the effort distribution of engineers on an ongoing project. Based on the advancements in the field of NLP and its wide use of AI applications in last few years, it is very evident that NLP can help organizations solve multiple business problems related to product quality. In this chapter, we examine several concrete use cases where NLP can help us make intelligence-based decisions rather than only data driven decisions:

  1. Duplicate defects and test cases
  2. Orphan defects and test cases
  3. Log analysis and mining
  4. Sentiment analysis

Use case #1: Duplicate defects and test cases

Defects can be discovered almost throughout the entire test process. Sogeti TMAP describes the various steps that the tester should perform when a defect is found. These include establishing confirmation of an anomaly, reproducing it, and determining its cause, among other things. One of the steps in this process is to determine whether an anomaly is clearly a duplicate. Numerous duplicate anomalies are an obvious indication of an ineffective testing methodology. While we investigate how AI can assist us in addressing this issue, our recommendation is to refocus attention on the overall test strategy.

We propose here a technical approach to handle duplicate defects (by description/ by summary) and test cases (by names/ by description).

Preprocessing steps

There are certain steps to be taken to make the unstructured textual data into some form of structured data so that we can perform some analytics on them. The general preprocessing steps are as below:

  1. Lower casing the documents
  2. Tokenization: Word Tokenization or Sentence Tokenization : Splitting into words/sentences
  3. Remove the punctuation marks
  4. Remove the numbers if needed.
  5. Trim the extra white spaces
  6. Remove the stop words such as: the, is, a, an, and, was, there, here, be etc.
  7. Stemming/Lemmatization: Convert words in various forms into their root words for e.g. Played, playing will be stemmed to single word – play

Data transformation

After the preprocessing step, transform the data into structured format and to do the same we compute the below metrics:

  1. Term Frequency (TF): How many time a word has appeared in the Document. A word appearing more frequently will have more importance.
  2. Inverse Document Frequency (IDF): In how many documents a particular words has appeared. The word appearing most might not be very important for example words like “the”, “no”, “be” might have higher frequency but might not be very important. So to suppress this, we find this Inverse Document Frequency.
  3. Term Frequency-Inverse Document Frequency (TFIDF): Product of TF and IDF that suppress the trivial words from being most important and also highlight the least frequent words at the same time.

After we have converted the unstructured data into structured format, it’s time we compare them and find how much similar they are. To find the similarity between two text documents, we have used Cosine Similarity matrix.

Cosine Similarity (A, B) = A.B/ (||A||.||B||)

= ∑ (A*B) / {sqrt (∑ (A2)) * sqrt (∑ (B2))}

 

Where A and B are two document vectors and sqrt (a) is square root of a. The similarity value ranges from 0 to 1. Cosine similarity = 0 means the two documents A and B doesn’t have any similar words common in them and Cosine Similarity = 1 means both A and B has exact same words in them (ignoring the stop words).

Then we find the best 3 matches for each document in this cross product matrix corresponding to each document entity. Below is the entity names that is used to find the match in the above mentioned use case.

  • Duplicate Defects :
    • By Summary : Each Defect Summary is compared against all the remaining defect summaries within a project and the best 3 matches for this summary is selected based on the cosine similarity value.
    • By Description: Each defect description is compared against all the remaining defect descriptions within a project and the best 3 matches based on the cosine similarity.
  • Duplicate Test Cases :
    • By Name: Each Test case names is compared against all the remaining test case names within a project and best 3 matches are chosen based on their cosine similarity value.
    • By Description: Each Test case description is compared against all the remaining test case descriptions within a project and best 3 matches are chosen based on their cosine similarity value.
Figure: Outcome of the textual data analysis, with measured similarities

 

Figure: Outcome of the textual data analysis, with measured similarities 

Use case #2: Orphan defects and test cases

In theory, all test assets must have a correct relationship with their respective entities. For instance, all anomalies discovered during the test execution phase must be associated with the corresponding test cases, and all test cases must be associated with the corresponding requirements. However, in practice, not all assets are connected to their corresponding other entities. Orphan assets are those that lack a connection to their related entities.

In this regard, we have the following orphan artifacts:

  1. Orphan defects: defects that are not linked to any of the test cases so far.
  2. Orphan test cases: test cases that are not linked to any of the requirements (ALM)/ user stories (Jira).

Then, using Text Analytics or Natural Language Processing, we attempt to connect these assets to their related entities (NLP):

  1. Orphan Defects: Defects to Test Case Mapping using Defect Description and Test Case Name.
  2. Orphan Test Case: Test Case to Requirement Mapping using Test Case Name and Requirement Name (ALM)/ User Story Summary (Jira).

The general approach to these usage is similar. They all use the same text pre-processing processes, and their metrics for comparing two distinct textual documents are identical: Cosine Similarity Metrics.

The text preprocessing steps are the same as described in the previous use case.

The primary distinction is in what is being compared to what.

The following contains the entity names that are used to locate a match in the aforementioned use scenarios.

  1. Defects to Test Case Mapping: Each defects description is compared against all the test case names within a project and the best 3 matches for each description is selected based on their cosine similarity value.
  2. Test Case to requirement Mapping (ALM): Each Test Case name is compared against all the requirement names within a project and best 3 matches for each test case name is selected based on their cosine similarity value.
  3. Test Case to User Story Mapping (Jira): Each Test Case name is compared against all the user story summary within a project and best 3 matches for each test case name is selected based on their cosine similarity value.
Figure: Outcome of the textual data analysis, with measured similarities

Figure: Outcome of the textual data analysis, with measured similarities 

The expert corner: librairies used in use cases #1 and #2

Python

(NLTK): NLTK is a leading platform for building python Programs to work with human language data. It provides easy to use interface to over 50 corpora and lexical resource such as WordNet, along with suite of text processing libraries for classification, tokenization, stemming, tagging, parsing and semantic reasoning.

  1. corpus(): is a collection of documents.
  2. re(): to clean the word of any html-tags and also the punctuations.
  3. lower(): Lower Casing the Documents .
  4. Tokenization(): to split a sentence into tokens.
  5. Stop-words(): Stop words are said to be useless data for a search engine. Words such as articles, Preposition etc are considered as stop words and we will be removed that words
  6. PorterStemmer():to stem the tokens to their root form. Examples: Running-Run , cookies-cooki and flying-fly
  7. tf_idf(): It convert a collection of raw documents to a matrix of TF-IDF features
  8. count_vect(): Transform text into spares matrix of n-grams counts.
  9. cosine_similarity() : It compute cosine similarity between vectors and build a matrix.

Then we find the best 3 matches for each document in this cross product matrix corresponding to each document entity.

R Programming Language

In R, there are multiple package available to perform these tasks such as openNLP, tm, Quanteda, tidytext etc.

Among these we have used Quanteda package because of its simplicity, better performance and speed. The approach in all these packages are more or less similar except some functionality difference.

In Quanteda package to achieve the tasks mentioned above, we have used these functions mentioned below:

  1. corpus() : to build corpus of text and their corresponding ID values.
  2. tokens() : to tokenize the sentences into words
  3. token_wordstem(): to stem the tokens to their root form.
  4. dfm(): to build a Document Feature Matrix which has the documents in rows and words/features in columns. Each row is a different documents and each column represent different word/feature.Then transform the resultant data into simple triplet matrix format so that is easy to compute its cross product using
  5. tcrossprod_simple_triplet_matrix() function. And that’ how the cosine similarity matrix is build.

Then we find the best 3 matches for each document in this cross-product matrix corresponding to each document entity.

About the authors

Vivek Jaykrishnan

Vivek Jaykrishnan

Vivek Jaykrishnan is an enterprise test consultant and architect with extensive experience. He has over 22 years of experience leading Verification and Validation functions, including functional, system, integration, and performance testing, in leadership positions with reputable organizations. Vivek has demonstrated success working across a variety of engagement models, including outsourced product verification and validation in service organizations, independent product verification and validation in captive units, and globally distributed development in a product company. Additionally, Vivek has extensive experience developing and implementing test strategies and driving testing in accordance with a variety of development methodologies, including continuous delivery, agile, iterative development, and the waterfall model. Vivek is passionate about incorporating cognitive intelligence into testing and is also interested in exploring the frontiers of IoT testing.

Vivek Sejpal

Vivek Sejpal

Vivek is an passionate data scientist with more than five years of experience in analytics and machine learning. He is currently engaged in the design and development of intelligent data-driven assets for quality engineering applications. Vivek works with customers on end-to-end engagements to demonstrate the value of various Intelligent asset offerings across domains/industries. Additionally, Vivek contributes to the research and development of new intelligent assets for the purpose of resolving critical business problems through a data-driven approach that is consistent with business requirements.

Abhinandan H PatilAbhinandan H Patil

Abhinandan is Author of 10 Technology Books and 14 Scientific Articles in Journals. Before this, he has worked in Wireless Network Software Organization as Lead Software Engineer for close to a decade. Abhinandan was in USA for two long stints and was instrumental in Releases of Mobility Manager at Motorola USA as Single Point of Contact for Network Simulator Tool. His Research is available as Books and Thesis in IJSER, USA. His Thesis published as Book is rated as one of the best Books of all time for Regression testing by BookAuthority.org. Awarded RULA Award for the same Thesis in 2019. He is Active Researcher in the field of Machine Learning, Deep Learning, Data Science, Artificial Intelligence, Regression Testing applied to Networks, Communication and Internet of Things. He is active contributor of Science, Technology, Engineering and Mathematics. He is currently working on few Undisclosed Books. He has started Blogging recently on Technology and Allied Areas. He is a RULA Research Awardee in 2019. He is Adarsh Vidya Saraswati Rashtriya Puraskar Awardee in year 2020. Abhinandan is Senior IEEE member since 2013 and is member of Smart Tribe and Cheeky Scientists Association. He also holds mini MBA from IBMI, Germany. UGC-NET Qualified (2012). Recipient of several Bravo awards for deserving work at Motorola. He is on the Editorial Board of few Scientific Journals. Dr. Patil is an ardent reader of STEM( Science, Technology, Engineering and Mathematics). He has a desire to contribute more to STEM.

About Capgemini

Capgemini is a global leader in partnering with companies to transform and manage their business by harnessing the power of technology. The Group is guided everyday by its purpose of unleashing human energy through technology for an inclusive and sustainable future. It is a responsible and diverse organisation of 325,000 team members in nearly 50 countries. With its strong 55 year heritage and deep industry expertise, Capgemini is trusted by its clients to address the entire breadth of their business needs, from strategy and design to operations, fueled by the fast evolving and innovative world of cloud, data, AI, connectivity, software, digital engineering and platforms. The Group reported in 2021 global revenues of €18 billion.

Get the Future You Want  I  www.capgemini.com

 

 

 

Capgemini logo