State of AI applied to Quality Engineering 2021-22
Section 3.2: Inform & Interpret

Chapter 4 by Capgemini

NLP for downstream recommendations

Business ●○○○○
Technical ●●●●○

Listen to the audio version

Download the "Section 3.2: Inform & Interpret" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Policy.*

Understanding the impact of change is critical. SME’s knowledge is expensive in terms of time and effort. Customers seek to automate extraction of knowledge from structured and unstructured documents.

Human communication and knowledge is captured in documents. These documents come in all sizes and shapes within corporate settings. Almost all of these are unstructured text documents. Software code is developed only after documents are created outlining the changes to be made. The cycle begins with a requirements document, which is subsequently transformed into a technical design document. This design is then implemented as code and verified by test cases (another unstructured text document).

There is considerable expertise required at each stage of this document creation. The figure below shows the documents view of the process (depending on the software lifecycle used, the names of these documents vary).

Figure: Documents view of the process


In this article, we will focus on two use cases which we have implemented using NLP/NLU[1] which greatly improved quality of the process. The uses cases are based on understanding of input documents to give suggestions for downstream use cases.



Use case: Address matching

Insurance customers get data from multiple source systems. It is required by law for the company to maintain unique customer IDs for each of the customer although the customer could be having multiple accounts and products. To create a single ID for the customer, automation of address matching is required. The problem is that address in different source systems are not stored in a consistent way and there is a lot of missing data. In one system it could be Arizona and in another it is AZ and in the 3rd system it is completely missing. To complicate matters, many buildings within cities have multiple customers. One building in NY has 1500+ customers for this insurance company, so the only differentiating factor is PO Box number or door number.

Regular fuzzy matching techniques do not perform well and also orders of magnitude slower when dealing with millions of addresses. NLP models have become very powerful since the introduction of word embeddings which is in essence creating a proxy for the input word. This proxy is an vector, but has context embedded in it if trained with model architectures like BERT. We trained sentence embedding models which created vectors for this input address and then did clustering with auto-encoders which gave very high degree of automation which would not have been possible with manual effort or traditional matching techniques.

Figure below shows a sample of the automated mapping created by the tool. It also extracts the rules at each stage which can be modified manually.

The same solution could be extended for a cable operator who was looking to create unique IDs for his customer base which keeps changing homes. For example, the same customer moving to a new location will be show as a new customer. Here again, name matching is not enough, but other attributes need to be considered. Using the same technique as before, we trained models that outperformed the current manual process of curating addresses.

Use case: Impact analyzer

Moving from BRD document to technical document requires SME knowledge. SMEs are not easily available and this causes considerable delay in getting the design document done, which in turn impacts the timeline and productivity.

We use NLP extraction techniques to extract from BRD the relevant items required for reports building. We then filtered the text to extract the relevant terms. From these terms, we developed a recommender algorithm that, using historical BRD to technical document mapping, proposes the appropriate rules and data pieces from the data lake. A support-confidence-based set of rules was produced for the purpose of identifying the rules.

Sample Support-Confidence framework based rule-extraction is given below. SPMF, which is an open source tool is used for rule extraction.




Once the rules are extracted and tested for correctness, the system continuously learns from how the underlying transactions change (has the suggested rules been implemented or is there a change in the actual implementation). This again changes the confidence score of the rule and the recommendation for the next cycle changes.


For example, in the Business Requirements Document (BRD) for developing reports, the text given was “Spend in Euro for the selected data range”. There was only one spend item for the card in the underlying data lake. So this was easy to map and suggest as per the historical rules. But another item – “Select Date” – was impossible to map. There are literally hundreds of date elements within the data lake and all of them are heavily used in reports. So this date part has to be done manually – which is the suggestion given by the tool.

These associative rules engines work well only when there is a reasonable “support” which is the first part of the Support-confidence framework. Newer data elements created in the system are unlikely to have support and so we had to combine associative rule mining with rare event mining and apply filters around creation date of the elements identified.

In other words, these solutions don’t fully depend on machine learning or deep learning. There are elements of plain business logic which are combined to give good results to users.

About the author

Rajeswaran ViswanathanRajeswaran Viswanathan

Rajeswaran Viswanathan is the head of AI Center of Excellence in India. He has published many papers and articles. He is the author of proportion – A comprehensive R package for inference on single. Binomial proportion and Bayesian computations. It is most widely used for categorical data analysis. He has a passion for teaching and mentoring next generation of data scientists.

About Capgemini

Capgemini is a global leader in partnering with companies to transform and manage their business by harnessing the power of technology. The Group is guided everyday by its purpose of unleashing human energy through technology for an inclusive and sustainable future. It is a responsible and diverse organisation of 325,000 team members in nearly 50 countries. With its strong 55 year heritage and deep industry expertise, Capgemini is trusted by its clients to address the entire breadth of their business needs, from strategy and design to operations, fueled by the fast evolving and innovative world of cloud, data, AI, connectivity, software, digital engineering and platforms. The Group reported in 2021 global revenues of €18 billion.

Get the Future You Want  I




Capgemini logo