Representative performance tests based on real-life production traffic

AI is unquestionably a technology that can assist us in performance engineering. In this chapter, we examine how to leverage AI to assess production traffic as a data source, and more.

Cycle times for software development and delivery are becoming increasingly shorter and more frequent. Application environments are more complex than ever, with SaaS, microservices, legacy monolithic applications, and packaged enterprise applications such as SAP all highly connected and coexisting. And performance has become critical: users and customers expect a seamless digital experience on any device, at any time, and from any location. Application performance has become a critical differentiator in the competitive landscape. Nevertheless, far too frequently, validating performance via realistic testing has become a bottleneck.

Performance engineering (of which performance testing is integral) has traditionally been a practice that requires numerous manual steps and long-to-acquire knowledge involving a steep learning curve:

Coordinating with business experts and gathering data to build the strategy that tests the right things, under the right load
Actually building robust and maintainable test scripts
Configuring the workload of each test
Running tests, either manually or programmatically
And, of course, analyzing results and working with software engineering teams to resolve identified issues

While modern solutions do leverage automation to accelerate many aspects of performance and load testing, the approach to answering the fundamental question “What is a representative test that reflects real-world behavior and traffic?” has not changed in the past 25 years. It continues to follow the legacy process established by the market's pioneers. Due to the fact that it is a highly manual and time-consuming endeavor, the majority of performance tests do not accurately reflect real-world conditions.

In order to provide testers with representative synthetic performance scenario elements as fast as possible, a solution needs to synthesize/mimic as closely as possible the behaviors of real users on production systems. This could be accomplished manually by collecting and digesting the various available datasets, but AI/ML could automate the analysis of production traffic in order to scope scenario elements that will be used in testing.

Tricentis NeoLoad has developed a prototype solution, codenamed RACHAEL, that we are validating with various enterprises that it can leverage AI to assess production traffic as a data source; detect different populations and their load curves, pacing, and think times; isolate signal from noise; extract variables, dynamic parameters, and pre-identified user paths; and more.

The problem with the legacy approach to representativeness

Performance testing is, of course, all about understanding how applications and systems behave under realistic conditions. What does performance look like when various numbers of users are doing different things concurrently? Can the application handle peak loads and traffic spikes (Black Friday, fiscal year-end, product launches)? What are users actually doing most often — what test scenarios represent what’s happening in real life?

But the methodology for figuring out what is “realistic” — determining representativeness — has remained basically the same for decades. It is still a highly manual, time-consuming yet inexact process that depends heavily on the knowledge held by a handful of experts, educated guesses, and painstaking analysis from stitching together disparate data sources by hand.

Rarely to never is a systematic representativeness assessment done between performance tests and the load mix/usage in production.

We interviewed businesses of various sizes and industries with varying degrees of performance engineering sophistication and discovered that a systematic representativeness assessment between performance tests and the load mix/usage in production is performed infrequently to never.

Today, most performance tests are built from:

Usage scenarios that re-create what are thought to be typical user paths. Enterprises define these scenarios in a variety of ways, from relying on their performance experts' experience and tribal knowledge to leveraging insights from the application's business experts to a more methodical but labor-intensive approach that involves collecting data from various sources (logs, JavaScript trackers such as Google Analytics, and APM tools) and manually analyzing it for common user paths.
Build scenarios on top of converted functional test plans. Repurposing functional tests as performance tests has become simple and straightforward. For example, turning Tricentis Tosca functional tests into NeoLoad performance tests can be accomplished with just one click. While this significantly reduces the time required to write the performance test script — and Tosca even uses AI-powered smart impact analysis (LiveCompare) to prioritize what to test functionally — we're still left with the question of what constitutes a representative (real-world) load in production.
Empirical load or stress test requirements. Tests are built based on categorically quantified requirements set forth in SLAs. Often these goals appear to be precisely defined metrics — “the system must be able to handle 500 transactions per second” — but where do these benchmarks come from? What if users are performing other transactions at the same time? Can the system still handle 500 transactions per second? Empirical load requirements do not guarantee that the system will be able to handle the load in production, as they do not take into account other factors or how users actually use the system.

Our approach: Overarching principles

When Tricentis NeoLoad evaluated the potential for an AI-augmented solution, we examined various facets of performance engineering (performance test modeling, design/scripting, test execution, and analysis) to determine where AI could be most beneficial. We wanted to concentrate our efforts on the areas that would have the greatest impact on performance engineers. Where could AI assist testing teams in eliminating manual work to enable performance testing to run more quickly while maintaining a higher level of quality?

NeoLoad’s existing capabilities (low-code/no-code test design, automated script updates, dynamic infrastructure, native integration with OpenShift and Kubernetes-based orchestrators) already make the design/scripting and test execution parts of the process very fast and easy. And a primary objective of our AI approach was to carve out more time for performance engineers to spend on analysis (and/or be able to handle more tests in less time). That’s where human experts provide the most value, where they can do things that a machine simply cannot. Which brings up an important point: AI is not intended to replace human performance engineers but rather to augment their capabilities. AI enables “the machine” to do what it does best — absorb huge amounts of information and make sense of it faster and more thoroughly than is humanly possible. Let AI handle the grunt work so humans can do the brains work. Which brings us to performance modelling.

About 70% of a performance testing schedule is consumed by getting a representative working scenario. What if we could cut that time in half?

Performance (or workload) modelling defines how and what to test to ensure a realistic test scenario that properly covers risk. It is the most critical but, as discussed, most painstaking and time-consuming part of designing the test. Enterprises have told us that in a typical non-CI/CD project, about 70% of a performance testing schedule is consumed by getting a representative working scenario. What if we could cut that time in half?

Consequently, we approached development of our AI-assisted performance modelling prototype to empower testers to accomplish two overarching goals:

Trust the test The AI must guarantee that synthetic scenario behaviors are as representative as possible.
Design the test faster The AI must automatically provide the scenario elements (user paths, load policies, dynamic parameters) that today are the most labor-intensive (and therefore costliest) to build.

Figure: Relation between accuracy and trust

We broke down these two high-level goals into a few smaller objectives:

Synthetize actual user behavior on production systems into actionable performance scenario elements
Easily put in the hands of all users — even testing “beginners” — information that is today accessible only to experts (think time, pacing)
Reduce time spent on collecting and digesting the different available datasets

Automate the analysis of production traffic to scope precisely scenario elements to be used in testing

Further, we felt it was crucial that the AI solution be self-explanatory. Nobody should need to learn AI or become some kind of AI expert to use it. It should be simple and straightforward so even performance testing non-experts can get value from the solution. A solution that saves time down the road but has a high learning curve won’t work. Performance teams already too little time to get done everything that’s demanded of them.

Right now, in the prototype validation phase, RACHAEL is self-hosted and self-controlled (primarily to address security concerns of the validating enterprises), but we foresee it becoming fully integrated into day-to-day performance engineering workflows. It would be a SaaS-based centralized system acting as an engine to fuel performance testing automation. An example would be the way Tricentis uses its Vision AI to power analysis directly within its Tosca functional testing platform.

We also envision the solution becoming contributive. The analysis engine will “learn” from user corrections. For instance, if a performance engineer determines that something should be classified as a user-input variable rather than a dynamic parameter, the AI will absorb the modification moving forward. The AI will get smarter and smarter, to further enrich analysis

A final issue we addressed when talking with the enterprises who are validating the solution revolves around business value. The technical people were naturally concerned about how intrusive the AI would be to their systems. What we seek to validate is that the level of value provided far outweighs the introduction of AI into existing systems. For context, we brought up application performance monitoring (APM) as a comparison. While APM does intrude somewhat into existing systems, the value delivered makes that intrusion more than worth it. RACHAEL provides similar business value yet is even less intrusive than APM.

In sum, we developed a prototype to leverage data and AI-based feedback from production to quickly get representative synthetic performance scenario elements with a secure, unobtrusive solution.

RACHAEL under the hood

In building out RACHAEL, we wanted the solution to be able to deliver automatically the crucial definitions, or elements, needed for representative performance modelling:

The load testing scenarios that would need to be designed
Different populations and their typical user path
The think time that would need to be applied
The pacing
The load policy — the typical curve it follows

In any AI project, the most important element is the quality of your data. The data needs to have enough details to apply the right algorithm to parse, group the traffic, and generate the user flows identified in the production traffic.

Nowadays there is a ton of data available that can help create a realistic model of real-world scenarios, but it’s spread across multiple data sources (logs, APM, marketing tracking systems, etc.). Sometimes engineers have access to this data, sometimes not. Different scenarios call for different data elements. Either way, it can take weeks to retrieve and cobble together the data that “fits” the scenario you are testing.

Protocol-based load testing requires the exact http traffic to generate a script with the following level of details:

Hostname
Port
URL
URL parameters
Body of the post requests
http headers
Cookies
Session id
Etc.

For various reasons (storage, data privacy, security), most data sources collect only some of the data necessary to create a representative performance model. That’s the biggest challenge: to get the right data source that has enough details. How to extract user traffic data from production for a specific period of time to generate use cases for the test? How best to capture and leverage user experience data to define precise settings — load policy, think time, pacing — for the test? How to identify realistic locations and actual network conditions?

In a nutshell: how could we use AI to turn actual network traffic into a data source?

Figure: RACHEL mirroring production data

NeoLoad evaluated several different AI approaches to see what best fit our goals. All AI solutions are really a combination of different techniques, technologies, and methodologies — there is no universal magical system. An approach that met our overarching-principles criteria was traffic mirroring.

Traffic mirroring is a technique whereby live production traffic is copied and sent to two places: the original production servers and a testing environment. Cloud providers like AWS and Azure already offer this option, and we validated this technology with large-scale users. Essentially, it allows you to work with real-life traffic without real-life impact.

With RACHAEL, we use a small, non-intrusive mechanism that duplicates (mirrors) the traffic that is going in and out of the application in production. The traffic is then stored and analyzed for a representative period of time (say, 8 hours for a workday) to extract the most valuable elements needed for your scenario. You get data on a wide range of usages — how different populations are actually using applications. For each such population, you can determine a representative synthetic user path that, say, 80% of users follow. You can also establish a realistic and accurate load policy (load curve) for each population. Do they spike very quickly from the beginning of the time frame? Or do they go up to a point and then plateau? Mirroring actual production traffic also reveals other important properties such as the think time between each action or the pacing between each session.

Where AI can save a lot of time and headache is defining variables and dynamic parameters. While modern performance engineering solutions such as NeoLoad make it significantly easier to manage parameters with dynamic values through advanced correlation and frameworks, determining which parameters are dynamic and generating data for these variables remain some of the most time-consuming tasks. Not only does AI accelerate the analysis, but mirroring production traffic captures data from the system under test that may be closed to your production environment. With RACHAEL, a dataset is provided for each variable, so the effort of variabilization is drastically reduced. But you don’t have to use these datasets. If test data and production data are not close (e.g., healthcare, insurance), you can simply forgo the AI-generated datasets.

In sum, leveraging production traffic as a data source allows performance testers to extract with precision the crucial elements needed to build realistic, representative scenarios:

Variables and dynamic parameters, with auto-generated datasets
Representative user populations, their typical user paths in the application
Real-life properties such as the think time between each action or the pacing between each session
How to ramp up these populations to reflect the real production load curve

RACHAEL in action

Tricentis NeoLoad and its partners are currently deploying the RACHAEL prototype in several large companies across different industries for validation purposes. We have already validated the AI in controlled lab environments with increasingly complex scenarios. Now we want to validate the prototype at enterprise scale in real-world situations, with businesses that have a broad range of performance engineering activities.

At each company, we take a two-step approach: we deploy RACHAEL in a pre-production environment, tweaking the AI as needed and building trust with local teams, then deploy it in a production environment to get the representative elements for performance tests. Typical examples use cases would be:

A hospitality company could use RACHAEL to understand API integrations. All the different booking sites (Expedia, Kayak, Booking.com, etc.) use APIs to pull information about the company’s hotel rooms — descriptions photos, prices — and each API would have a specific population, different load curve, with different behaviors.
An online examination company would find in RACHAEL elements about different populations. First, the students taking the test would use one set of application features for 2 hours. Then a second population — the “teachers” — would access a different set of features to grade the tests. RACHAEL would detect two different populations with very different behaviors (because they’re using two completely different feature sets). The company could automatically see two different load curves, with different user paths and other elements such as pacing between sessions and think time between actions.
A national video streaming platform could use RACHAEL to define the load curve for prime time viewing traffic. Specifically, the company could observe a 47X increase in traffic at 8:00 PM. There’s this huge spike as everyone connects, then they all access the catalog services for 15 minutes before starting to the video streaming features.

Once again, these are examples of possible use cases, but they give a good idea of what RACHAEL results could provide in terms of value and how it could cut in half initial design efforts for performance engineering projects.

Summary

Performance testing is simply simulating realistic traffic against a system to measure how it would behave under load.  So, if the test does not accurately simulate real-world conditions — if it is not representative — you will lack confidence in the system's performance in production. After all, an unrealistic test yields unrealistic results.

But pulling together the data that enables engineers to create a realistic test can take weeks. Today’s development/delivery cycles are shorter than that. Tricentis NeoLoad has built a prototype solution that utilizes traffic mirroring, machine learning, and other AI techniques to leverage live production traffic as a data source. The best way to ensure that the test mirrors real-life situations is to base it on real-life data.

The Tricentis NeoLoad prototype sits at the crossroads of technological enablement and organizational enablement. AI technology helps create more-realistic performance tests at today’s faster DevOps speed.

About the author

Patrice Albaret

Patrice Albaret

Patrice Albaret

Patrice is the lead product manager of Neoload. He is a pragmatic and technical product manager with over 19 years of experience in technology. He is at the intersection of technology, market needs, and business strategy. His background enables him to work collaboratively and honestly with software, data science, marketing, and executive teams to deliver high-value products in Agile environments.

Specialties include B2B SaaS, open source, innovation management, infrastructure and cloud computing, and urban decision support systems.

About Tricentis

Tricentis is the global leader in enterprise continuous testing, widely credited for reinventing software testing for DevOps, cloud, and enterprise applications. The Tricentis AI-powered, continuous testing platform provides a new and fundamentally different way to perform software testing. An approach that’s totally automated, fully codeless, and intelligently driven by AI. It addresses both agile development and complex enterprise apps, enabling enterprises to accelerate their digital transformation by dramatically increasing software release speed, reducing costs, and improving software quality. Tricentis has been widely recognized as the leader by all major industry analysts, including being named the leader in Gartner’s Magic Quadrant five years in a row. Tricentis has more than 1,800 customers, including the largest brands in the world, such as McKesson, Accenture, Nationwide Insurance, Allianz, Telstra, Moet-Hennessy-Louis Vuitton, and Vodafone.

Visit us at www.tricentis.com

Cookies	Description
Registered visitor cookie	Cookie given to each registered user.
Registered visitor functionality cookie	Cookies used to remember the unique identifier given to each registered user.
Social plug-in content sharing cookie	Cookies set by services such as Facebook Connect or Twitter Button, which allow social networks users to share the content of our websites on social networks.
Unregistered visitor cookie	Cookies used to give to unregistered users a unique identifier in order to recognize them and to analyze how they use the website.
Analytic cookie	Cookies used to store URLs of the previous page visited, enabling to track users navigating from inside or from outside the website. If you click on a Sogeti advertisement on a non-Sogeti website, a cookie may be used to log which website you are on, in order to ensure our advertisements are served effectively and to measure whether our advertisements are viewed. Google Analytics: cookies set by Google analytics are used for web analytical purpose, but are not used to track individual users. For further information on how Google Analytics collects and uses information on our behalf and the right to use such cookies, please refer to the Google Analytics products and services privacy statement. If you object to your Personal Data being collected by Google Analytics, you may download and install the Google Analytics Opt-out Browser Add-on. Pardot: cookies set by Pardot are used to track users on our website. Visits are tracked for known users only. Unknown users are recorded as anonymous users. Please refer to Pardot privacy policy for any further information on their use and your rights related to the use of such cookies.