État de l'Intelligence Artificielle appliquée à l'Ingénierie de la Qualité 2021-2022
Section 5: Manage data

Chapter 4 by Sogeti & Capgemini

Use case: a synthetic data platform for quality engineering

Business ●●●●○
Technical ●○○○○

Listen to the audio version

Download the "Section 5: Manage data" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Policy.*

This chapter discusses several real-world examples of how a synthetic data generation platform works and the resulting significant benefits.

As we all know, synthetic refers to something that is not real or natural; the same holds true for data. Synthetic data is not a new concept; it has existed in quality assurance and testing for long time. Using traditional tools, you could create your own rule-based production data that could be used for a variety of testing purposes.

The primary disadvantage of using rule-based synthetic data is that it may not accurately represent real-world data, as defined rules cannot possibly cover all possible use cases and business behaviors. To be considered effective, a system must pass the test in real-world scenarios. In these specific use cases or scenarios, it is necessary to conduct rigorous testing using real production data. This, however, is no longer permitted under GDPR regulations. As a solution, businesses began anonymizing datasets for quality assurance and testing, as global regulations no longer consider anonymized data to be personal data.

For a financial institution, where data privacy is critical and they have strict restrictions on how they use the actual data. It is challenging to test the application with high-quality data in order to obtain the best results. Several significant obstacles include the following.

  1. Firm cannot use production data for testing as it violates privacy regulations
  2. Production data cannot be transferred intra-company due to sensitivity of information and privacy regulations
  3. Heavy penalties are imposed for any violation of data privacy

To address the aforementioned issues, Sogeti has developed a deep learning-based platform named ADA (Artificial Data Amplifier).

ADA is an artificial intelligence solution that generates synthetic data that has the appearance and feel of real data. By providing complete access to data without jeopardizing customer trust, compliance, or privacy and security, ADA enables analytics and software development.

ADA is scalable, which means that it can generate large amounts of data from a relatively small dataset. ADA maps real-world data and generates synthetic data that retains all relationships and characteristics.

ADA is capable of producing data in a variety of formats, including tabular data (databases), images, and free text.

How does it work?

ADA is a six-step process, the first four of which are part of the implementation, during which Sogeti's ADA experts will work to train and implement an AI model, and the final two of which will assist end users in requesting data.

Figure: ADA six-step process

Figure: ADA six-step process

 

Step 1 – Extract a real dataset – ADA requires high-quality "production-alike" data in order to train the AI model on data characteristics. This step will extract and clean up the data.

Step 2 – Ingest to ADA – The extracted data will be ingested into the ADA framework, which will then be used to learn about the data. Various accelerators will be used to pre-process the data and make it trainable during this stage. Additionally, this step will be used to create a mapping of the data's referential integrity.

The term "referential integrity" refers to the reliance and relationship of data across databases, tables, and applications.

Step 3 – Scrub or Mask – This is an optional step in which the data is scrubbed or masked if it is too sensitive to train with in its current state.

Step 4 – Train ADA – This stage utilizes the mapping and training set data to train an ADA framework model. The model gains knowledge of the data's characteristics and relationships during this stage by iteratively traversing the data. This stage results in a trained model that is completely knowledgeable about the data (the data type, format, rules, nature of data, relationships etc.).

Step 5 – Generate Data – At this stage, an interface will be used to connect to the train model, and end users will be able to request data by providing the model with a few simple parameters. The model will generate the data and perform the necessary operations to store it.

Step 6 – Push data – This step is used to push data to a database or shared drive, or to an API or other mode of data transfer. This step is completely customizable to the client's specifications.

The trained model can be deployed both in the cloud and on-premises. The entire technology stack is open source and requires no licenses.

Where can we use ADA?

  1. Testing and Development
    1. Generate production quality data for higher grade of quality engineering.
    2. Generate testing data for ML models and accelerate the Dev and QA process.

  2. GDPR Compliant
    1. Synthesize sensitive client information
    2. Synthesize any other PII information
    3. Make the data sharing secure within company.

  3. Need of More Data
    1. For most of the AI implementation, we need more data to train our model. ADA can generate synthetic data to enable AI implementations.
    2. Performance testing

The Solution

The ideal situation will be to have ADA, that can produce such data on demand for the quality engineers at the real time.

Consider an ecosystem in which there are numerous production data warehouses and data marts containing real data. There are two different scenarios possible to help testing

  1. Data being created in the lower environment –
    1. This can be time consuming
    2. Random choice of data used to create data
    3. All scenarios of data not covered
    4. Not production quality
    5. Costly

  2. Copy data from upper environments
    1. Time and resource consuming. Sometimes takes weeks for data to be made available.
    2. Masking is imposed for data security
    3. Masking is not a 100 % secured solution. Can be reverse engineered.
    4. Costly

Figure: ADA, the synthetic model

Figure: ADA, the synthetic model

 

With ADA in place, the synthetic model can be trained using real-world data. Once the synthetic model is trained, the system automatically learns the data characteristics of the solution; the data generating system does not need to look at the data again. It is self-sufficient in terms of data generation on demand.

Once the trained model is in place, Quality Engineers can request relevant data for testing purposes by passing some parameters to the system via a user interface (UI) that is powered by the API for connecting to the training model. Similar data from a large volume of production can be generated quickly (in minutes) and can assist engineers in mitigating data issues.

The question that arises with synthetic data is always how much we can trust it. We can compare the similarity of the data from the real data using the latest statistical comparison methodologies. Numerous reporting mechanism parameters, such as correlation matrix, detectability score, and duplicity score, aid in determining the quality of the data in comparison to the actual data. The synthetic data quality report demonstrates that a machine-generated dataset is suitable for validating the quality of other applications.

Use Case #1: US-based tax and audit firm

We implemented ADA as a synthetic data generation solution for one of the largest tax and audit firms in the United States of America. The framework was implemented across their landscape for over 50 applications utilizing a variety of technologies. Several significant benefits accrued as a result of this implementation include the following:

  1. Can be used to test applications while remaining regulatory compliant,
  2. Can be used to perform performance testing on scalable generated data,
  3. Capable of generating data on demand,
  4. High-quality data enables more accurate testing.

Use Case #2: Swedish government agency

A large Swedish government agency was attempting to incorporate artificial intelligence into its daily operations. The agency's data included highly personalized and extremely sensitive information, necessitating the adoption of stringent security measures and conducting ethical reviews in advance of any work being performed.

This demonstrated that substituting synthetic data for real data opens up a plethora of opportunities for leveraging data to drive business value.

The ADA platform generated tabular, image, and unstructured text data that could be used in conjunction with or in place of the original data. By utilizing synthetic data, it got possible to make data available to the public while maintaining its confidentiality.

To know more about Artificial data Amplifier, please refer to https://www.sogeti.com/ada

About the author

Mark Oost

Mark Oost

Mark has over 10 years of experience within AI and Analytics. Before joining the group Mark was responsible for AI within Sogeti Netherlands where he was responsible for the development of the team and business as well as the AI Strategy. Before that he worked as a (Lead) Data Scientist for Sogeti, Teradata and Experian. Where he worked for clients from multiple markets internationally on technologies around AI, Deep Learning and Machine Learning.

Arun Sahu

Arun Sahu

An experienced technology specialist with a focus in AI and data analytics. Arun leads the AI CoE team in Sogeti India and is part of the Sogeti Global AI team. He has designed and developed various AI solutions and offering with his team. His roles and responsibility includes Working with AI leads from various countries to help in sales, proposals, pre sales, solutioning and implementation.

About Sogeti & Capgemini

Part of the Capgemini Group, Sogeti operates in more than 100 locations globally. Working closely with clients and partners to take full advantage of the opportunities of technology, Sogeti combines agility and speed of implementation to tailor innovative future-focused solutions in Digital Assurance and Testing, Cloud and Cybersecurity, all fueled by AI and automation. With its hands-on ‘value in the making’ approach and passion for technology, Sogeti helps organizations implement their digital journeys at speed.

Visit us at www.sogeti.com

Capgemini is a global leader in partnering with companies to transform and manage their business by harnessing the power of technology. The Group is guided everyday by its purpose of unleashing human energy through technology for an inclusive and sustainable future. It is a responsible and diverse organization of 270,000 team members in nearly 50 countries. With its strong 50 year heritage and deep industry expertise, Capgemini is trusted by its clients to address the entire breadth of their business needs, from strategy and design to operations, fueled by the fast evolving and innovative world of cloud, data, AI, connectivity, software, digital engineering and platforms. The Group reported in 2020 global revenues of €16 billion.
Get the Future You Want | www.capgemini.com