State of AI applied to Quality Engineering 2021-22
Section 9: Trust AI

Chapter 5 by BranchKey

Keep test criteria current using Federated Learning

Business ○○○○○
Technical ●●●●●

Listen to the audio version

Download the "Section 9: Trust AI" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Policy.*

The key premise of federated learning is that models constructed on discrete data can be aggregated into a single global model. This article will walk through an example case of the benefits of federated learning to software testing.

Numerous tests are centered on simulating predicted user data or behavior in order to do pre-deployment testing or, more recently, train machine learning models. These simulated tests and statistics are never as accurate as the real thing. We're going to look at how an emerging privacy-focused technology, Federated Learning, may be used to keep test criteria current, regardless of whether they're static or dynamic. Today, we'll focus on a whole system test model, zooming in on integrated components. We'll train our model on test data, deploy it to a production system for fine tuning, and use Federated Learning to close the feedback loop on a machine learning system in real time.

For those new with Federated Learning, the key premise is that models constructed on discrete data can be aggregated into a single global model. Individual models contain critical information about the systems being modeled. Now that the data has been recorded, the raw data is no longer required for the model and may be discarded. We can combine these information representations in the form of machine learning models with those from other datasets. Federated Learning is a privacy-by-design technology, bringing the models to the data rather than the other way around. This makes data leaks of potentially sensitive data much less likely, especially so on a large scale, there are now multiple locations to compromise rather than one central database.

We’re especially interested in using Federated Learning to analyse our running services to learn their load profile. This is a multi-fold problem; we want to learn our system behaviour which on the one hand requires a lot of data for understanding as much about the optimisation space as possible, while on the other hand not compromising on our security. How do we learn the failure cases of our services? How do we keep our monitoring stack up-to-date with a low level of human intervention? And how do we put security and quality at the front of this challenge? This article will walk through an example case to introduce anyone interested in accelerating automation in test.

Example Setting

Consider the following scenario: we want to analyze the load profile of our service pods that were developed and deployed on our development Kubernetes cluster. This is to develop a strategy for their production deployment, with the goal of reducing infrastructure costs and avoiding service interruptions. For this example, we wrote our code appropriately and have a high level of test coverage, which means that the individual applications we accept are run-safe within reasonable constraints. The following are the questions we encounter prior to running the system tests:

  1. How do we know our load profile? and how the various service pods will interact under load?
  2. How do we identify and account for edge cases when they occur?
  3. In a dynamically changing system, that share underlying nodes, how do we optimize pod placement?
  4. How do we keep these system models up-to-date across different deployments or even different companies’ tech stacks?

We’re looking for a method to measure these variables, model them, test pre-deployment, then deploy the model alongside our production system, and most importantly, keep the production system updating our model with real data.



Building the service wrapper and data model

Let’s take a prototypical service for discussion. Redis is a widely used technology that may appear in multiple different systems, but those organisations may not want to share exactly what they’re using the it for. We’re going to measure a few key variables here against time:

  1. Resources usage: CPU load and Memory footprint
  2. Service load: Requests per Second (RPS), Request size
  3. Service response: Response status, Response time

Our service wrapper exposes these variables internally on the cluster for our model to use as input. This helps us in building our data model. To do this, we have created an interface that allows us to clean up our variables, define their units and ranges, and convert some qualitative descriptors such as service faults to categorical discrete numbers. These are fairly broad standardisations that can be applied to a large number of additional microservices.



Using our data model to build the dataset

We need to build our machine learning model on test data, profile the service, and deploy to our production environment for fine-tuning. We’ll put a reference load on each Redis pod of 20 requests per second, with a request size of 1Mb to get our baseline. We can read our data model variables being output from our service wrapper (we have already controlled for RPS and Req. Size). This gives us an output something like this:


Let’s run our test suite as normal, permute the input variables, and subject the staging system to some load to construct a training dataset. Using our dataset we will construct some machine learning model for a chosen target — say response time and service outage (this isn’t really the focus of this article so we won’t dive into that here).

Up to now nothing special has happened, we simply wrapped our services in a metrics extractor, standardized our data model, and built some predictor for the given target. What makes this interesting is when we deploy to production and federate the system in order to improve the models.

Learning on Production

Now that we've migrated our service model from testing to production, we're experiencing false positive alerts for production scenarios for which the models were not trained. Our architecture follows a database-per-service pattern, with numerous Redis pods associated with independent services. This means that in a healthy condition, we can have between 5 and 10 distinct Redis deployments managing data formats, request sizes, and load patterns that are rather dissimilar. As a result, the number of valid system profiles for our production system is substantially larger than for our testing environment. This is why we need to use Machine Learning on production data. Because modeling every possible outcome in test cases is frequently impractical, we rely on the real thing to determine our test requirements.

Each of these independent deployments has its own service wrapper and, depending on the use case, may have its own model — this is not essential in a single cluster situation, since we transmit our metrics to our time-series Prometheus database first and model from there. However, when these clusters are cross-regional, or when a staging cluster is used to update a production model, or when clusters belong to separate organizations that do not desire to exchange raw metrics, we model per cluster and aggregate only the learnings. For this example, we've placed a model on each service and constructed a model from each, resulting in the following:



Each service was provided with a generic model and trained an adjusted model using their individual load patterns (something like transfer learning). Our issue today is that we have deployed replicas of our model that have diverged from one another in their search for a valid solution to their local challenge. What is needed is to close this feedback loop, allowing independent models to collaborate and federate to create a global model that can be used by any service – without disclosing their data.

Federating to build robustness

Multiple models are being trained independently on distinct datasets, and a single model is continually aggregated to build a global model that outperforms the individual models but has never seen any data. We create a set of tests in the image below that covers a substantial fraction of the universe of possible outcomes (left). Each color reflects a space discovered by an individual model on the right of the illustration. We're interested in merging all of these individual models into a unified model in order to have a better understanding of our system's overall health profile. This should show how we're going to increase the dynamic nature of our load profiles; let's delve deeper.



We now have a clear idea of the service we wish to profile. This Redis service is replicated ten times across our cluster, each handling a slightly different load profile and exhibiting slightly different behavior. However, because the replicas are the same Docker image, the fundamental profile of each Redis service should be comparable, despite the fact that they are used differently. Our model has been deployed to each of these services. Each deployment includes a service wrapper embedded within the pod and a lightweight monitoring model that is currently learning the profile of the pod. At the heart of our cluster is an aggregator to which each monitoring model sends parameters and some meta-data about the service version, the model version, the node ID, and the number of parameter adjustments, among other things. This aggregator is integrating divergent load profiles to create a global load profile for that service.

The final step

The aggregator has integrated these models into a single global model that describes the system's (almost) complete set of legitimate outcomes. The final stage in the procedure is to deploy this modified model. To begin, we notify each model deployment that an aggregated update is available; each model will then retrieve and conduct an inference test on a small sample of locally stored data to verify that the model has improved (this inference test can also be a transfer step whereby the global model is first adjusted back to local data before deployment). Now that we have a statistic to report to the central aggregator about the performance of the updates, and with a few more checks in place to avoid disrupting uptime or degrading performance, we can push the model update into the service monitor and continue the process.

This is a continually updated monitoring system that is trained using private service data that is never shared between locations, but the models are integrated to create a global model that can be reused in many locations. In certain circumstances, we may execute staging load tests that provide an updated consumption profile for a healthy service, while in others, we can capture production failures to replicate in staging and learn about the edge case's constraints. We are not looking to overlook outliers; rather, we are looking to include them into our models. Our models should be able to detect correlations between control variables and error in order to provide evidence of causality. This provides an explanation for why the system level models are triggering and provides us with alarms for the next time this zone is entered.


In summary, we demonstrated a wrapper for our services that monitors and extracts metrics. These services can be deployed across clusters, within organizations, between regions, in staging/test/production environments, or even across different organizations willing to collaborate in order to reduce wasted resources replicating failure models and boost up-time. Each of these services, regardless of where they are deployed, has a model that is tailored to the service's unique load profile. Following that, without ever disclosing the purpose for which the service was used or the data that passed through it, we can extract an anonymised model of system behavior and federate it with others to gain a better understanding of these services under pressure. This is used to forecast the placement of pods on multi-node clusters and to configure service alarms for use in scaling plans (such as limiting RPS or pre-emptively replicating pods to serve demand). All of this is accomplished dynamically through the use of a collection of relatively simple machine learning models that are federated and collaborate to solve a common problem.

This article has offered an insight into BranchKey’s method for addressing the 4 questions outlined above and we’d love to hear your opinion.

  1. How do we know our service load profile?
    We learn our load profile through service wrappers collecting data from simulated and real load.
  2. How do we identify and account for edge cases when they occur?
    We catch outliers, incorporate them, and use them to learn more about our systems— turning testing into not only a preventative exercise but an exploratory one.
  3. How do we optimize pod placement?
    We can predict load profiles of services and select which node-groups these pods belong to.
  4. How do we keep these system models up-to-date?
    These systems are constantly updating each other in their federation with automated checks, alarms, audit trails, and manual intervention to control quality

About the author

Diarmuid Kelly

Diarmuid Kelly

Diarmuid Kelly is Co-Founder & Head of Product at BranchKey. He holds a Masters degree in Artificial Intelligence from the University of Groningen, the Netherlands specialised in distributed optimisation in multi-agent systems. With experience as a Test Automation engineer and distributed systems engineer, he focuses on scaling optimsation systems to edge based compute focusing on security and efficiency.His passion is building large scale IT systems under constraints that solve every day problems.

About BranchKey

BranchKey is a provider of Federated Learning-as-a-service. We believe in building a world where machines can collaborate securely, improve their learnings from data, and keep you, the data owner in control of your data. To achieve this we offer a federated learning platform that allows teamsof machine learners to process data on the edge and collaborate to find the optimal solution. BranchKey’s sector focus is multi-disciplinary.

Visit us at