Use Case #3 Shift-Left AIOps to enable Test Driven Operations
In the previous Use Case, we talked about a reactive approach to automating operational tasks. You wait for a problem, you learn about the root cause and then you try to make sure it either doesn’t happen again or you build an automation to automate the remediation the next time the problem comes around.
This approach can be equaled to engineers only writing tests for their code when users complain about features not working as expected, and not before. Fortunately, nowadays most engineering teams do test driven development where the tests are created first and code can only pass if the tests are green after every code change. AIOps can help us achieve something similar that I like to call “Test Driven Operations”.
SREs know how systems should behave in production under different workloads and conditions. SLOs are commonly used to validate and report behavior in production and auto-remediation scripts are used to keep the system available.
The more pro-active and “Shift-Left” approach would be to test the resiliency and auto-remediation scripts before entering production. Auto-Remediation scripts are also code, and that code should be treated equally to business code. This means that there must be a clear definition of what that auto-remediation code does, e.g.: restarting a service. There also must be a way to validate the intended outcome, e.g.: restart happens with an allowed downtime of maximum 10 seconds. Finally, that code must be triggered by the AIOps solution in case there is a problem with that service – so – we also need to test and validate if the AIOps solution can pick up problems in our “Test Driven Operations” environment.
In practical terms, this approach can be implemented using a combination of performance and chaos engineering tools in a pre-production environment that is also monitored with an AIOps solution. Keptn, the open-source project referenced earlier, provides this use case. Keptn orchestrates the execution of load tests, injects chaos, can execute, and validate auto-remediation scripts and validates the desired outcome through the AIOps integration.
In summary: AIOps is great to ensure healthy systems in production. The true power comes when integrating it into the engineering process. Just as Test-Driven Development has led to better quality code, Test-Driven Operations will lead to better and more stable production systems.