État de l'Intelligence Artificielle appliquée à l'Ingénierie de la Qualité 2021-2022
Section 5: Manage data

Chapter 1 by Sogeti & Capgemini

The rise of synthetic data

Business ●●●○○
Technical ●●○○○

Listen to the audio version

Download the "Section 5: Manage data" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Policy.*

As we all know, synthetic refers to something that is not real or natural; the same holds true for data. Synthetic data is not a new concept; it has existed in quality assurance and testing for centuries.

Using traditional tools, you could create your own rule-based production data that could be used for a variety of testing purposes.

The primary disadvantage of using rule-based synthetic data is that it may not accurately represent real-world data, as defined rules cannot possibly cover all possible use cases and business behaviours. To be considered effective, a system must pass the test in real-world scenarios. In these specific use cases or scenarios, it is necessary to conduct rigorous testing using real production data. This, however, is no longer permitted under GDPR regulations. As a solution, businesses began anonymizing datasets for quality assurance and testing, as global regulations no longer consider anonymized data to be personal data.

Anonymization is not the best method for securing data. As the data is real but has been anonymized, the biggest disadvantage will occur if you are able to re-identify individuals within your data set. This is now occurring, and numerous studies have demonstrated that we can re-identify individuals in datasets despite the fact that the data was deemed anonymous.

For instance, in 2016, journalists were able to re-identify politicians in the internet browsing histories of three million German citizens, and not only that, but they were also able to obtain the politicians' complete medical history, education, and sexual preferences. Another example is when the Australian Department of Health released an anonymized data set of medical records; within six weeks, researchers were able to decode it all. The third example concerns hospital discharge data from the United States; these data were re-identified by incorporating additional basic demographic characteristics, allowing for the extraction of citizens' complete diagnostic codes. All of these studies have demonstrated to the world that data anonymization is not the way forward. What then is the solution? Are we reverting to the use of traditional synthetic rule-based data in place of anonymized data? As a result, is quality assurance and testing performance harmed? The irony is that we can also create synthetic datasets that look exactly like your original data using AI, and this chapter will introduce you to the concepts of synthetic data. In the first part, we will discuss how to use synthetic data in place of analyzed data, and in the second one, we will discuss how to test full synthetic data and synthetic systems using synthetic data, as well as the rise of more complex synthetic systems.

Generative AI

In 2014, Ian Goodfellow, a machine learning researcher, co-authored the first paper on Generative Adversarial Networks with his colleagues; in this paper, he describes the GAN architecture's simplification. Which is a stack of two neural networks, a generator, and a discriminator in a nutshell. This paper contains two networks, one of which attempts to generate images and the other of which attempts to determine which images are real and which are not. For instance, if we want to generate dogs, we must first learn what a dog is. The generator begins synthesizing dogs with the learned characteristics of dogs, but the discriminators' task is to determine whether the generated images are of dogs. The initial iterations are very easy to distinguish, and all of this information (also known as weights) is then returned to the generator, instructing it on how to trick the discriminator. This is a continuous feedback loop that continues until the generator generates images that are virtually indistinguishable from real ones, allowing for the creation of extremely realistic images of, say, dogs.

Although the initial GAN architecture was immature and could only generate images of very low quality. Additional enhancements were being made to it. For instance, DCGAN, which was published in a paper in 2017, improved the image quality and image generation. Other aspects of architecture are improving as well, and images are becoming more realistic Daily. However, using GANs exclusively on images is not the only application; a large portion of the data we use in quality assurance is not in image format but rather in tabular format. How can we convert these GANs trained on images to tabular data? To begin, when we look at images mathematically, we only see pixels and the RGB codes that contain information about the pixels, which means that one image could be a table. Recently, researchers began investigating how these GAN architectures might be applied to traditional tabular data. Tabular data, on the other hand, imposes its own constraints. The first issue is that we have many distinct data types. We do not have a single type, such as integers, but also decimals, categories, and text. Time, names of dates, and so forth. Second, the distributions of these variables are quite dissimilar as well. Thirdly, many tables contain categorical variables with numerous options such as names, and even if you want to one-hot encode, you will end up with an enormous number of columns, making it impossible to apply a traditional GAN trained on image data to tabular data.

To address the problem of synthesizing tabular data, one of the first GANs for tabular data, TGAN, was invented. TGAN performs extensive preprocessing on the data before it can be used in a GAN network. For instance, categorical variables are encoded in a single pass. While TGAN performs admirably and frequently outperforms more traditional methods of creating synthetic data using AI, such as Bayesian networks, there is still a performance gap between real and synthetic data with TGAN.

CTGAN is another GAN model that has significant improvements over the TGAN model. Among the enhancements is the ability to handle multiple modes; for example, discrete columns can be extremely unbalanced, making modeling extremely difficult. As previously stated, more traditional statistical and deep neural networks are prone to misfit this data, which is why CTGAN outperforms Bayesian models on at least 87.5 percent of data sets, bringing us closer and closer to actual usable synthetic data that looks and feels like your original data.

There are still issues that we are resolving in the modeling components of GANs. For instance, referential integrity and generation of single table data are quite simple. However, when scaling this up across multiple datasets, you must also consider the connections between different tables and systems, which remains a significant challenge for current GAN architectures due to GANs' high resource requirements. For these connections, we continue to use different types of modeling such as oh plus to determine where we can train the model more traditionally. Another issue with synthetic data in general is the absence of names, addresses, and bank account numbers. Why is this a problem? This is because, when it comes to bank account numbers, you cannot simply generate one for quality assurance and testing purposes. Additionally, you must have a valid bank account, so we can address these strict rule-based columns using the more traditional synthetic data that we have been using for quite some time. Thus, adding fictitious names and fictitious bank accounts that we know will work. By combining AI and the traditional use of synthetic data, we can create realistic synthetic datasets or synthetic databases, which is a relatively new and unexplored area of research. We have and are currently starting with linguistics counts, but as we see the work that our clients are doing with larger datasets and databases, we can start scaling it to multiple systems as well. This is a story about GANs that you can use in your quality assurance and testing practice. These new systems will alter how Test Data is used and generated.

How GANs work

GANs are algorithmic architectures that employ two neural networks in opposition to one another (hence the term "adversarial") in order to generate new, synthetic instances of data that can pass for real data.
To understand GANs, one must first understand how generative algorithms work, which can be accomplished by contrasting them with discriminative algorithms. Discriminative algorithms attempt to classify input data; that is, they predict a label or category to which an instance of data belongs based on its features.

For instance, given all the words in an email, a discriminative algorithm can predict whether the email is spam. Spam is one of the labels, and the bag of words extracted from the email is the input data.
As a result, discriminative algorithms associate features with labels. They are only interested in that correlation. One way to think about generative algorithms is that they perform the opposite of what they are supposed to. Rather than attempting to predict a label given a set of features, they attempt to predict features given a set of features.
Using the same example, a generative algorithm attempts to answer the following question: How likely are these features, assuming this email is spam?

GANs are comprised of two Neural networks,

  1. Generator – Generates new instances of the data
  2. Discriminator – the discriminator decides whether each instance of data that it reviews belongs to the actual training dataset or not.

Consider the generation of data as an example. The Generator creates the fake data instance, which is then fed into the discriminator alongside data from the real, ground-truth dataset.

The discriminator accepts both real and fake data and returns probabilities, a number between 0 and 1, with 1 indicating authenticity and 0 indicating falsification.
When the generator model can bluff the discriminator, we have a good generator model.

Figure: The generator model

Figure: The generator model

Now that we understand how GANs work, let's look at how synthetic data can aid in real-time testing. As previously stated in the chapter on the importance of quality data, we require high-quality production data in order to produce the best QA results. We define production quality data as the range of possible data variances in a real-world situation, the data's characteristics, and the data's relationship to other data within the same ecosystem, among other things.

Four key benefits to quality engineering

  1. Generate a Risk-Free and Scalable Data Set – Because the data is synthetic and cannot be reverse engineered, GANs enable GDPR-compliant data sharing within and across organizations.

  2. Exponential opportunities – The dataset generated is of sufficient quality to be used for a variety of other purposes as well. i.e. The Dataset is interchangeable with real data and can be used to unlock and accelerate the performance of numerous complex AI systems. solutions

  3. High-quality and rapid data generation – Using GAN, you can generate a large volume of data in a matter of minutes.

  4. ETL and database agnostic – Because the framework is ETL and database agnostic, it can be used to train and generate synthetic data with any database and infrastructure. Several of the implementations we've completed included Oracle, SQL, SAP, MongoDB, and mainframe.

In chapter 4 of this section, we will demonstrate how to use GANs to create a synthetic data platform.

About the author

Mark Oost

Mark Oost

Mark has over 10 years of experience within AI and Analytics. Before joining the group Mark was responsible for AI within Sogeti Netherlands where he was responsible for the development of the team and business as well as the AI Strategy. Before that he worked as a (Lead) Data Scientist for Sogeti, Teradata and Experian. Where he worked for clients from multiple markets internationally on technologies around AI, Deep Learning and Machine Learning.

Arun Sahu

Arun Sahu

An experienced technology specialist with a focus in AI and data analytics. Arun leads the AI CoE team in Sogeti India and is part of the Sogeti Global AI team. He has designed and developed various AI solutions and offering with his team. His roles and responsibility includes Working with AI leads from various countries to help in sales, proposals, pre sales, solutioning and implementation.

About Sogeti & Capgemini

Part of the Capgemini Group, Sogeti operates in more than 100 locations globally. Working closely with clients and partners to take full advantage of the opportunities of technology, Sogeti combines agility and speed of implementation to tailor innovative future-focused solutions in Digital Assurance and Testing, Cloud and Cybersecurity, all fueled by AI and automation. With its hands-on ‘value in the making’ approach and passion for technology, Sogeti helps organizations implement their digital journeys at speed.

Visit us at www.sogeti.com

Capgemini is a global leader in partnering with companies to transform and manage their business by harnessing the power of technology. The Group is guided everyday by its purpose of unleashing human energy through technology for an inclusive and sustainable future. It is a responsible and diverse organization of 270,000 team members in nearly 50 countries. With its strong 50 year heritage and deep industry expertise, Capgemini is trusted by its clients to address the entire breadth of their business needs, from strategy and design to operations, fueled by the fast evolving and innovative world of cloud, data, AI, connectivity, software, digital engineering and platforms. The Group reported in 2020 global revenues of €16 billion.
Get the Future You Want | www.capgemini.com