A question of bias and why we need to change our approach to testing in an AI World
It’s a long time since I sat down and watched 2001: A Space Odyssey, but a few months ago, that’s just what I did. If you haven’t seen the film, which was produced back in 1968, there’s a stand-out character: HAL, a lovable, if somewhat murderous AI that ran the spaceship. HAL was an acronym for heuristic and algorithmic, which in the words of the film’s co-writer and director Stanley Kubrick were “the two methods of computer programming”.
Watching the film again several years after I last saw it, I approached it with my software testing head on. I had good reason for doing so. I am interested in the explosive growth in the use of AIs across all sorts of industries.
Not a day goes by without something new appearing. I have a few Amazon Alexas at home and, although Alexa and other home agents still have some way to go before they start reaching the level of HAL’s complexity, ultimately, it’s the task of developers and testers to make them work. But how we achieve this needs to change. As AIs proliferate we, as testers, have to test AI systems as part of our normal workload, as well as deal with tools that employ AIs to support the testing process.
Translating data into action
To return briefly to HAL, ‘he’ wasn't evil. But he had been given some conflicting instructions that had left him with no option but to try and kill the spaceship crew. He'd been trained to relay information accurately to the crew, but also to withhold information from them. Killing the crew was his solution to resolving the problem.
Clearly, that’s taking the singular focus of AIs to the extreme. Nonetheless it serves to illustrate the clear goals many systems/AIs tend to have, such as to win at chess, recognize a picture, or determine a person’s gender. They use complex computational algorithms to translate vast volumes of data into an action. But what if there is a bias in the initial data set? What impact will that have on the outcomes?
Let me explain. A recent research paper from MIT and Stanford University discusses the findings of a study of three facial-analysis programs. This found that all three programs demonstrated both skin-type and gender bias. The data sets on which the programs based their actions had a clear bias towards white males. Now, if we translate that into AI used in manufacturing, let’s say in the development of a safety feature for a connected car, what if the original data on which the feature is developed and tested has a bias? How confident would you be of its efficacy in terms of keeping you safe?
Measuring the unexpected
I’m not scaremongering here. Rather I am setting out my stall for testers to adopt a new approach: one that accommodates bias. That’s because we have to adapt to make sense of our new, digital world. Even before we had testers, we had the mantra “what's the expected results?”. It's been fundamental to testing for many, many years. No expected results! No defect!
Now we have to change that perception. We have to move away from measuring what we expect because, with AIs, you are likely to get results you didn’t expect. This is something we've started to see at Sogeti, both with some of the systems we have been testing for our customers and in our own tools and engines that we've been working on for the last few years. Even with a test data set you think you know well, you can get unexpected results.
I saw this recently in a long-running program tracking defects, tests and feature changes. In this particular instance, the project owners had started prioritizing features based on the original program assumptions, rather than updating them with lessons learned and changes in the operating model. So, they brought in more defects which, while being genuine, weren't errors that would be found on day one. But as they were fixed, they then started to destabilize the rest of the code base. My point here is that there was a bias towards the original program assumptions, and this changed the way the program progressed. If you put bad or incorrect data in the first-generation test, it will continue and magnify in the second generation.
Applying cognitive thinking
So, what should we do as testers in this new world of AI: a world where it is easy to get the wrong result; and where one wrong decision based on those results could kill your business?
We need to aware of our bias, or the bias of the design. We need to be able to challenge the AI, to test it and ask why it is giving the result we’re seeing. Just because you don’t get the expected result, doesn’t mean there is no defect. This is gradually happening. Last summer, I asked people in my office if they’d been involved in developing or testing an AI. Just one person said yes. Six months later, I asked again, and six people said yes.
Three key takeaways
It is still early days. We are likely to end up taking some lessons from big data, and some from other engineering practices. AI is clever, but it is a long way off being intelligent. As humans we can, and must, apply critical thinking and don’t assume a result is 100% correct. AI is still largely driving yes and no decisions – yes, insure this person, no don’t insure them – whereas human testers can apply generic patterns and think cognitively. What’s clear to me is that we are all on an exciting journey.
More tips can be found in the NEW book 'Testing in the Digital Age' (available 1st June 2018)