State of AI applied to Quality Engineering 2021-22
Section 3.2: Inform & Interpret

Chapter 1 by Capgemini

NLP & NLU Fundamentals

Business ●○○○○
Technical ●●●●○

Listen to the audio version

Download the "Section 3.2: Inform & Interpret" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Notice.*

The true success of NLP is in the fact that humans are deceived into believing they are communicating with other humans rather than machines.

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence that studies the interactions of computers and human language, specifically how to train computers to process and analyze massive volumes of natural language data. Among the applications is for a computer to "understand" the contents of documents, including the contextual nuances of the language contained inside. The technology can then extract accurate information and insights from the documents, as well as categorize and organize them.

Natural Language Processing can be termed as “the ability of machines to understand and interpret human language the way it is written or spoken.”

NLP deal with different aspects of language such as:

  • Phonology – It is a systematic organization of sounds in language.
  • Morphology – It is a study of words formation and their relationship with each other.

Approaches of NLP for understanding semantic analysis:

  1. Distributional: it employs large-scale statistical tactics of Machine Learning and Deep Learning.
  2. Frame-Based: the sentences which are syntactically distinct but semantically equivalent are represented within a data structure (frame) for the stereotyped situation.
  3. Theoretical: This technique is based on the premise that sentences refer to the real world (the sky is blue) and that sentence fragments can be merged to express the entire sentence's meaning.
  4. Interactive Learning: It takes a pragmatic approach, with the user instructing the computer step by step in an interactive learning environment.

The true success of NLP is in the fact that humans are deceived into believing they are communicating with other humans rather than machines.

Process of NLP

There are three different levels of linguistic analysis done before performing NLP:

  1. Syntax: What part of the given text is grammatically right?
  2. Semantics: What is the meaning of the given text?
  3. Pragmatics: What is the purpose of the text?

The mechanism of Natural Language Processing involves two subsequent processes:

  • Natural Language Understanding
  • Natural Language Generation

Natural Language Understanding

Natural Language Understanding (NLU) attempts to comprehend the meaning of a given text. NLU requires knowledge of the nature and structure of each word contained within a text. To comprehend structure, NLU makes an attempt to resolve the following ambiguity in natural language:

  • Lexical ambiguity: words have multiple meanings,
  • Syntactic ambiguity: a sentence is having multiple parse trees,
  • Semantic ambiguity: a sentence is having multiple meanings,
  • Anaphoric ambiguity: a phrase or word which is previously mentioned but has a different meaning.

Next, the sense of each word is understood by using lexicons (vocabulary) and set of grammatical rules. However, certain different words are having similar meaning (synonyms) and words having more than one meaning (polysemy).

As we have seen with the above use cases, several attributes/artifacts are expressed in Natural Language at every stage of the software engineering lifecycle, such as commit comments, code review comments, remarks in task management and defect management tools, requirements, and test case description. Natural language attributes/artifacts have enormous potential if the information is appraised and utilized properly using NLU. For instance, natural language commit comments created by developers while checking in code might be used to denote the type of work performed by the developer.

Natural Language Generation

It is the process of automatically producing text from structured data in a readable format with meaningful phrases and sentences. Natural language generation is a difficult problem to solve. Natural language generation divided into three proposed stages:

  • Text Planning: ordering of the primary content in structured data is done.
  • Sentence Planning: the sentences are combined with structured data to represent the flow of information.
  • Realization: finally, grammatically accurate sentences are generated to represent the material.

Overview of NLP Techniques

  • Pattern Recognition : This technique compares log messages to those recorded in a pattern book in order to filter out log messages. In quality engineering, pattern recognition can assist in detecting frequently occurring system failure patterns, which can then be considered candidates for auto diagnosis and recovery via solution framing.

  • Text Normalization : Normalization of text messages is the process of converting disparate messages to a common format. This is done when separate log messages use distinct terminologies but originate from the same source, such as programs or operating systems. By normalizing text, we seek to reduce its variability and bring it closer to a predefined "standard." This enables us to limit the amount of varied data that the computer must process, hence increasing efficiency. The purpose of normalization techniques such as stemming and lemmatization is to reduce a word's inflectional and occasionally derivationally related forms to a single base form. Text normalization can aid in the streamlining of analysis performed on natural language sources such as comments, descriptions, and so on, in order to conduct analysis for information retrieval.

  • Automated Text Classification and Tagging : Text classifiers can be used to organize, arrange, and categorize virtually any type of text - from documents, studies, and files, to text found throughout the web. Text Classification & Tagging is the process of ordering and tagging text using a variety of keywords for subsequent analysis. In software engineering, text classification can be used to quickly and cost-effectively structure a variety of relevant text, documents, social media, chat bots, and surveys. This enables time savings when evaluating text data, automation of corporate processes, and decision-making based on data.

  • Artificial Ignorance : It is a technique that makes use of Machine Learning algorithms to eliminate irrelevant log messages. It is also used to detect anomalies in system functionalities and software operations, enabling engineers to generate tailored alerts in real time in the event of a server/system malfunction or security breach.

Natural Language Processing Methodology

Using NLP involves some basic NLP operations listed below:

  1. Data Collection: Collect relevant data.
  2. Segmentation & Tokenization: Segmentation of text into components. The process of breaking down a text paragraph into smaller chunks such as words or sentence.
  3. Text Cleaning: Removal of unnecessary elements.
  4. Vectorization & Feature Engineering: Transformation of the text to numerical vectors.
  5. Text Lemmatization & Steaming: Lemmatization considers the context and converts the word to its meaningful base form, stemming removes the last few characters, often leading to incorrect meanings and spelling errors to reducing inflections for words.
  6. Training Model: Train a model using ML algorithms to solve NLP problem.
  7. Interpretation of the Result: Interpret and validate outcomes of the trained model.

Widely used NLP Libraries

There are many libraries, packages, tools available in market. Each of them has its own pros and cons. As a market trend Python is the language which has most compatible libraries. Below table will gives a summarized view of features of some of the widely used libraries. Most of them provide the basic NLP features mentioned earlier. Each NLP libraries were built with certain objectives; hence it can be said that, a single library might not provide solutions for everything.

NLP Libraries
Tools Features
NLTK
  • Well known & robust
  • High coverage of variety of NLP tasks
  • Supports multiple languages
  • No integrated word vectors
spaCy
  • Very fast NLP framework
  • Optimized functions for various NLP
  • tasks Neural Networks support for training NLP models
  • Low Language support
Scikit-learn NLP toolkit
  • Very effective and widely used
  • Well documented
  • No Neural Network support
Gensim
  • Primary use for unsupervised text modeling
  • Supports Deep Learning
  • Processes large and streaming data sets

Challenges in Natural Language Processing

In Quality Engineering, scope of applying NLP techniques to natural language data depends on multiple factors, which need to examined and evaluated before implementation. Some key challenges faced during implementation can be listed as below:

Data Challenges:

  • The main challenge is information overload, which poses a big problem to access a specific, important piece of information from vast datasets. Semantic and context understanding is essential as well as challenging for summarization systems due to quality and usability issues. Also, identifying the context of interaction among entities and objects is a crucial task, especially with high dimensional, heterogeneous, complex and poor-quality data.
  • Data ambiguities add challenges to contextual understanding. Semantics are important to find the relationship among entities & objects. Entities & object extraction from text and visual data could not provide accurate information unless the context and semantics of interaction are identified.
  • The next challenge is extraction of relevant & correct information from unstructured/ semi-structured data using Information Extraction(IE) techniques. It is necessary to understand the competency and limitations of the existing IE techniques related to data pre-processing, data extraction and transformation, and representations for vast volumes of multidimensional unstructured data. Higher efficiency and accuracy of these IE systems are very important. But, the complexity of big and real-time data brings challenges for Machine Learning based approaches, which are dimensionality of data, scalability, distributed computing, adaptability, and usability. Effectively handling sparse, imbalance and high dimensional datasets are complex.

Text related challenges:

  • Large repositories of textual data are generated from diverse sources. Though ML and NLP have emerged as the most potent and most used technology applied to the analysis of the text and text classification remains the most popular and the most used technique. Text classification could be Multi-Level (MLC) or Multi-Class (MCC). In MCC, every instance could be assigned to only one class label, whereas MLC is a classification that assigns multiple labels to a single instance.
  • Solving MLC problems requires an understanding of multi-label data pre-processing for big data analysis. MLC can become very complicated due to the characteristics of real-world data such as high-dimensional label space, label dependency, and uncertainty, drifting, incomplete and imbalanced. Data reduction for large dimensional datasets and classifying multi-instance data is also a challenging task.
  • Then there are the issues posed by a language translation. The main challenge with language translation is not in translating words, but in understanding the meaning of sentences to provide an accurate translation. Each text comes with different words and requires specific language skills. Choosing the right words depending on the context and the purpose of the content, is more complicated.
  • A language may not have an exact match for a certain action or object that exists in another language. Idiomatic expressions explain something by way of unique examples or figures of speech. Most importantly, the meaning of particular phrases cannot be predicted by the literal definitions of the words it contains.
  • The standard challenge for all new tools, is the process, storage and maintenance. Building NLP pipelines is a complex process — pre-processing, sentence splitting, tokenization, parts of speech [POS] tagging, stemming and lemmatization, and the numerical representation of words. NLP requires high-end machines to build models from large and heterogeneous data sources.

Example: NLP for Log Analysis & Mining

A log is a collection of messages from various network devices and hardware in chronological order. They are generated automatically in response to the occurrence of events and include natural language notes. Logs can be routed to files on hard disks or sent as a stream of messages over the network to a log collector. Logs enable the process of monitoring and maintaining hardware/software performance, parameter tweaking, software and system emergency and recovery, and application and infrastructure optimization. 

Natural Language processing techniques are widely used in Log Analysis and Log Mining: 

  • Log analysis is the process of extracting information from logs considering the different syntax and semantics of messages in the log files and interpreting the context with application to have a comparative analysis of log files coming from various sources for anomaly detection and finding correlations. 
  • Log Mining, also known as Log Knowledge Discovery, is the process of extracting patterns and correlations from logs in order to uncover knowledge and forecast Anomaly Detection if any are contained within log messages. 

To convert log messages into structured form, many techniques such as tokenization, stemming, lemmatization, and parsing are utilized. Once well-documented logs are accessible, log analysis and log mining are used to extract relevant information and knowledge from the data. 

About the authors

Vivek Sejpal

Vivek Sejpal

Vivek is an passionate data scientist with more than five years of experience in analytics and machine learning. He is currently engaged in the design and development of intelligent data-driven assets for quality engineering applications. Vivek works with customers on end-to-end engagements to demonstrate the value of various Intelligent asset offerings across domains/industries. Additionally, Vivek contributes to the research and development of new intelligent assets for the purpose of resolving critical business problems through a data-driven approach that is consistent with business requirements.

Vivek Jaykrishnan

Vivek Jaykrishnan

Vivek Jaykrishnan is an enterprise test consultant and architect with extensive experience. He has over 22 years of experience leading Verification and Validation functions, including functional, system, integration, and performance testing, in leadership positions with reputable organizations. Vivek has demonstrated success working across a variety of engagement models, including outsourced product verification and validation in service organizations, independent product verification and validation in captive units, and globally distributed development in a product company. Additionally, Vivek has extensive experience developing and implementing test strategies and driving testing in accordance with a variety of development methodologies, including continuous delivery, agile, iterative development, and the waterfall model. Vivek is passionate about incorporating cognitive intelligence into testing and is also interested in exploring the frontiers of IoT testing.

About Capgemini

Capgemini is a global leader in partnering with companies to transform and manage their business by harnessing the power of technology. The Group is guided everyday by its purpose of unleashing human energy through technology for an inclusive and sustainable future. It is a responsible and diverse organisation of 325,000 team members in nearly 50 countries. With its strong 55 year heritage and deep industry expertise, Capgemini is trusted by its clients to address the entire breadth of their business needs, from strategy and design to operations, fueled by the fast evolving and innovative world of cloud, data, AI, connectivity, software, digital engineering and platforms. The Group reported in 2021 global revenues of €18 billion.

Get the Future You Want  I  www.capgemini.com

 

 

 

Capgemini logo