An overview of synthetic data generation methods

The use of AI and ML, as well as the focus and investment in new test data generation capabilities has the potential to be transformative in the field of software testing.

Test data requirements for testing modern applications have become extremely complex due to the high level of dependency on datasets across applications.

Synthetic data is information that is artificially manufactured rather than generated by real-world events. Synthetic data is created algorithmically, and it is used as a stand-in for test datasets of production or operational data, to validate mathematical models and, increasingly, to train machine learning models.

Synthetic test data generators till date have focused on simpler test data generation needs. In order to build a synthetic test data generator that generates data that is internally consistent and can work across complex application scenarios we need the following - (1) recursively analyze production data sets to generate relational data models that is internally consistent: (2) ability to create large volumes of synthetic data with the recursively generated data model and sample data sets; and (3) ensure that the synthetic data generates positive, negative and noisy data to support privacy compliance rules (e.g. GDPR regulations).

Introduction to Synthetic Test Data Generation

Companies are increasingly needing to address the challenge of keeping up with the accelerated pace of development as the bar simultaneously continues to rise for higher quality code and absolute data privacy. A massive change is underway in the form of synthetic test data that can be generated on-demand as an alternative to the traditional approach of subsetting and masking production test data. In order to better understand the differences, we outline the two common approaches for test data generation.

Production test data

Production test data is a copy of a production database that has been masked, or obfuscated, and subsetted to represent a portion of the database that is relevant to a test case. Production test data is frequently accompanied by a test data management (TDM) system to prepare, control and use the data.

Commercial TDM systems can be expensive, costing upwards of hundreds of thousands of dollars for a typical enterprise deployment. Many organizations have chosen to develop their own in-house TDM systems and processes to save money and to provide a solution that more precisely meets their needs.

Synthetic test data

Synthetic test data does not use any actual data from the production database. It is machine-generated data based on the data model and identifying the patterns automatically from a sample of the production data. For the purpose of this article, we’ll assume synthetic test data is generated automatically by a synthetic test data generation engine.

Synthetic test data engines generate synthetic test data on-demand and according to a test data scenario that represents the needs of a particular test case. Synthetic test data generation eliminates the need for traditional TDM functions, such as masking and subsetting, because test data can be generated on-demand and without sensitive customer information.

As a result, TDG systems can be decentralized and operate through a self- service model.

Synthetic test data generation: Broad spectrum of choices and approaches

In this chapter, we will outline the broad spectrum of approaches associated with generating synthetic test data.

Sample Data

“Sample data,” as the name suggests, acts as a placeholder for real data during development and testing. Using sample data ensures that all of the data fields in a database are occupied, so that a program doesn’t return an error when querying the database.

Sample data is usually nothing more than data or records generated on the fly by developers in a hurry. However, dummy data can also refer to more carefully designed tests, e.g., to produce a given response or test a particular software feature.

This is not ideal for large scale testing because sample data can lead to unintended impacts and bugs unless it has been thoroughly evaluated and documented.

Rule-based Data
Rule-based data is similar to sample data, but generated more consistently, usually by an automated system, and on a larger scale. The purpose of rule-based data is to simulate real-world data in a controlled (rule) manner. For example, test data can be generated for fields corresponding to a person’s first and last names, testing what happens for inputs such as:

Very long or very short strings
Strings with accented or special characters
Blank strings
Strings with numerical values
Strings with reserved words (e.g., “NULL”)

Rule-based data can be useful throughout the software development process, from coding to testing and quality assurance. In most cases, rule-based data is generated in-house using various rules, scripts, and/or open-source data generation libraries.

The more complex the software and production data, the harder it is to mock data that retains referential integrity across tables.

Anonymized Data

Real data that has been altered or masked is called “anonymized”. The purpose is to retain the character of the real dataset while allowing you to use it without exposing PII or PHI.

More specifically, anonymization involves replacing real data with altered content via one-to-one data transformation. This new content often preserves the utility of the original data (e.g. real names are replaced with fictitious names), but it can also simply scramble or null out the original data with random characters.

Data anonymization can be performed in-house but it places an immense responsibility on the in-house development team to ensure that it’s done safely. Increasingly, companies are turning to third-party vendors to provide proven software solutions that offer privacy guarantees.

Subset Data

Subsetting allows for creating a dataset sized specifically to your needs, environments, and simulations without unnecessary data and often constructed in order to target specific bugs or use cases. Using data subsetting also helps reduce the database’s footprint, improving speed during development and testing.

Data subsetting must be handled with care to form a consistent, coherent subset that is representative of the larger dataset without becoming unwieldy. Given the growing complexity of today’s ecosystems, this is fast becoming an impressive challenge. At the same time, demand for subsetting is increasing with the growing shift toward microservices and the need for developers to have data that fits on their laptops to enable work in their local environments.

On its own, subsetting does not protect the data contained in the subset; it simply minimizes the amount of data at risk of exposure.

Large Volume Test Data

Automated production of generated data in large batches are critical for specific testing needs within the test automation approach. This is most useful when you care less about data itself, and more about the volume and velocity of data as it passes through the software. This is particularly useful in simulating traffic spikes or user onboarding campaigns.

For example, data bursts can be used in the following testing and QA scenarios:

Performance testing: Testing various aspects of a system’s performance, e.g. its speed, reliability, scalability, response time, stability, and resource usage (including CPU, memory, and storage).

Load testing: Stress testing a system by seeing how its performance changes under increasingly heavy levels of load (e.g. increasing the number of simultaneous users or requests).

High availability testing: Databases should always be tested to handle bursts in incoming requests during high-traffic scenarios, but testing for high-data scenarios is also important.

The Future of Synthetic Test Data Generation

Imagine a scenario where at the click of a button the test analyst can generate synthetic test data for all the testing needs across the entire enterprise application landscape. For instance, you can instantly generate all the test data required to test any enterprise across a landscape of custom and packaged applications.

This is the future we envision for synthetic test data generation. To realize this, we require three critical AI-driven capabilities for test data generation. They are -

Grammar based Synthetic test data generation
1. Data Objects - (focus on standard data elements of every application)
2. Define rulesets
  1. Rulesets are defined by users
  2. Machine learning can be incorporated into generating and modifying rulesets based on real-world observations
Complex relational Synthetic test data generation
1. Ability to have generators that can create data for multiple input fields simultaneously (e.g., an internally consistent address/city/state/zip generator), which goes against the typical 1:1 mapping of input field to behavior action used so far.
2. Ability to create “pick one from a set” type generators from the list of option values associated with a pick list.
3. Ability to adapt data generation based on values read from a web site
4. Ability to automatically select a generator for a given data field. For example, automatically detecting that a field is requesting an address, and connect an address generator.
The Synthetic Enterprise
1. This is a combination of above two building blocks
2. Pre-generated synthetic data sets for common enterprise needs such as Quote to Cash, Procure to Pay, Hire to Retire etc.
3. Synthetic data sets (millions of rows) for common use cases (e.g. retail, telecom, industrials etc.)
4. Synthetic data coverage across global markets

The use of AI and ML, as well as the focus and investment in new test data generation capabilities has the potential to be transformative in the field of software testing. The ability for AI to improve itself using synthetic data makes it a uniquely powerful technology. Synthesizing data is the key to enhanced quality and quantity of robust training data for advanced models and simulations.

The opportunity for synthetic data will extend beyond its use in current AI applications to industries across agriculture, autonomous vehicles, healthcare, robotics and more. Soon, an incredibly accurate digital mirror of reality will exist, built efficiently and accurately using synthetic data.

Albert Tort

Jim Whitehead

Jim Whitehead is Chief Scientist with Sauce Labs, and a Professor of Computational Media with the University of California, Santa Cruz. He brings over 25 years of experience as a software engineering and artificial intelligence researcher to his role at Sauce Labs. In software engineering, he performed early research using machine learning techniques to predict whether commits are buggy (just in time bug prediction). His computer games research focuses on artificial intelligence techniques for creating computer game content (procedural content generation). The unique synergies between computer games and software engineering research drive many research insights at Sauce Labs.

Albert Tort

Ram Shanmugam

Ram Shanmugam currently leads the low code automation business at Sauce Labs. Previously he was the founder and CEO of AutonomIQ, an AI-based low-code automation platform that was acquired by Sauce Labs. Before founding AutonomIQ, Ram was the co-founder, CEO and President of appOrbit, a venture backed leader in Kubernetes and Cloud orchestration and has held technology and product leadership roles at technology companies such as Cisco, HP, and SunGard. In addition to his professional experience, Ram is an active IEEE contributor in the area of AI and machine learning research and recognized as a technology pioneer by the World Economic Forum.

About Sauce Labs

Sauce Labs is the company enterprises trust to deliver digital confidence. More than 3 billion tests have been run on the Sauce Labs Continuous Testing Cloud, the most comprehensive and trusted testing platform in the world.

Sauce Labs delivers a 360-degree view of a customer’s application experience, helping businesses improve the quality of their user experience by ensuring that web and mobile applications look, function, and perform exactly as they should on every browser, OS, and device, every single time.

Sauce Labs enables organizations to increase revenue and grow their digital business by creating new routes to market, protecting their brand from the risks of a poor user experience, and delivering better products to market, faster.

Visit us at saucelabs.com

Cookies	Description
Registered visitor cookie	Cookie given to each registered user.
Registered visitor functionality cookie	Cookies used to remember the unique identifier given to each registered user.
Social plug-in content sharing cookie	Cookies set by services such as Facebook Connect or Twitter Button, which allow social networks users to share the content of our websites on social networks.
Unregistered visitor cookie	Cookies used to give to unregistered users a unique identifier in order to recognize them and to analyze how they use the website.
Analytic cookie	Cookies used to store URLs of the previous page visited, enabling to track users navigating from inside or from outside the website. If you click on a Sogeti advertisement on a non-Sogeti website, a cookie may be used to log which website you are on, in order to ensure our advertisements are served effectively and to measure whether our advertisements are viewed. Google Analytics: cookies set by Google analytics are used for web analytical purpose, but are not used to track individual users. For further information on how Google Analytics collects and uses information on our behalf and the right to use such cookies, please refer to the Google Analytics products and services privacy statement. If you object to your Personal Data being collected by Google Analytics, you may download and install the Google Analytics Opt-out Browser Add-on. Pardot: cookies set by Pardot are used to track users on our website. Visits are tracked for known users only. Unknown users are recorded as anonymous users. Please refer to Pardot privacy policy for any further information on their use and your rights related to the use of such cookies.

An overview of synthetic data generation methods

Download the "Section 5: Manage data" as a PDF

Use the site navigation to visit other sections and download further PDF content

The use of AI and ML, as well as the focus and investment in new test data generation capabilities has the potential to be transformative in the field of software testing.

About the author

About Sauce Labs