In this chapter, we will outline the broad spectrum of approaches associated with generating synthetic test data.
“Sample data,” as the name suggests, acts as a placeholder for real data during development and testing. Using sample data ensures that all of the data fields in a database are occupied, so that a program doesn’t return an error when querying the database.
Sample data is usually nothing more than data or records generated on the fly by developers in a hurry. However, dummy data can also refer to more carefully designed tests, e.g., to produce a given response or test a particular software feature.
This is not ideal for large scale testing because sample data can lead to unintended impacts and bugs unless it has been thoroughly evaluated and documented.
Rule-based data is similar to sample data, but generated more consistently, usually by an automated system, and on a larger scale. The purpose of rule-based data is to simulate real-world data in a controlled (rule) manner. For example, test data can be generated for fields corresponding to a person’s first and last names, testing what happens for inputs such as:
- Very long or very short strings
- Strings with accented or special characters
- Blank strings
- Strings with numerical values
- Strings with reserved words (e.g., “NULL”)
Rule-based data can be useful throughout the software development process, from coding to testing and quality assurance. In most cases, rule-based data is generated in-house using various rules, scripts, and/or open-source data generation libraries.
The more complex the software and production data, the harder it is to mock data that retains referential integrity across tables.
Real data that has been altered or masked is called “anonymized”. The purpose is to retain the character of the real dataset while allowing you to use it without exposing PII or PHI.
More specifically, anonymization involves replacing real data with altered content via one-to-one data transformation. This new content often preserves the utility of the original data (e.g. real names are replaced with fictitious names), but it can also simply scramble or null out the original data with random characters.
Data anonymization can be performed in-house but it places an immense responsibility on the in-house development team to ensure that it’s done safely. Increasingly, companies are turning to third-party vendors to provide proven software solutions that offer privacy guarantees.