In summary, we demonstrated a wrapper for our services that monitors and extracts metrics. These services can be deployed across clusters, within organizations, between regions, in staging/test/production environments, or even across different organizations willing to collaborate in order to reduce wasted resources replicating failure models and boost up-time. Each of these services, regardless of where they are deployed, has a model that is tailored to the service's unique load profile. Following that, without ever disclosing the purpose for which the service was used or the data that passed through it, we can extract an anonymised model of system behavior and federate it with others to gain a better understanding of these services under pressure. This is used to forecast the placement of pods on multi-node clusters and to configure service alarms for use in scaling plans (such as limiting RPS or pre-emptively replicating pods to serve demand). All of this is accomplished dynamically through the use of a collection of relatively simple machine learning models that are federated and collaborate to solve a common problem.
This article has offered an insight into BranchKey’s method for addressing the 4 questions outlined above and we’d love to hear your opinion.
- How do we know our service load profile?
We learn our load profile through service wrappers collecting data from simulated and real load.
- How do we identify and account for edge cases when they occur?
We catch outliers, incorporate them, and use them to learn more about our systems— turning testing into not only a preventative exercise but an exploratory one.
- How do we optimize pod placement?
We can predict load profiles of services and select which node-groups these pods belong to.
- How do we keep these system models up-to-date?
These systems are constantly updating each other in their federation with automated checks, alarms, audit trails, and manual intervention to control quality