State of AI applied to Quality Engineering 2021-22
Section 9: Trust AI

Chapter 1 by Sogeti

Quality engineering applied to artificial intelligence

Business ●○○○○
Technical ●●●●○

Listen to the audio version

Download the "Section 9: Trust AI" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Policy.*

To accelerate the rate of AI adoption amongst the public, corporations must learn to identify and mitigate potential ethical risks effectively. This means having a quality framework in place to guide the development of machine learning solutions to instill trust and integrity among relevant stakeholders and end-users.

As artificial intelligence (AI) continues to advance and become an industry standard for performing and automating everyday tasks, the barrier to widespread AI development and adoption is no longer the technology itself and the skillset required – it’s the ‘human’ aspect of it all: ethics, morality, transparency, and governance. Corporations must now learn to identify and mitigate potential risks (such as a discriminatory model) effectively to avoid biased, unfair machine learning (ML) models being put into production[1]. To mitigate this, putting a quality framework in place to guide the development of ML solutions to instill trust and integrity among relevant stakeholders is crucial.

This chapter introduces a Sogeti developed Quality AI Framework (QAIF) to test ML and AI algorithms in all phases of the AI development cycle. The framework provides a practical and standardized way of working that outputs trustworthy AI.


Testing AI is non-traditional

In the executive introduction of this research, we stated that the advancement of artificial intelligence raises two fascinating questions about our field:

  • Testing with AI: How do we use artificial intelligence to make quality validation smarter? This is all about leveraging ML and analytical solutions to accelerate the testing & development pipeline.
  • Testing of AI: How do we easily and effectively validate AI solutions? This is all about validating the data, algorithm, outcome, and ethical considerations.

The time has come to approach the latter, which is what this chapter will focus on. In 2019, the EU released a set of guidelines to building trustworthy AI[2] that include the ethical pillars such as fairness, transparency, and robustness, in addition to accountability, privacy, traceability and lawfulness. However, a practical implementation of these principles was not included with the guideline.

To understand how to create practical ethics tests and quality control checks to test AI, we must first understand how AI is developed and implemented.


From idea to operations: understanding the AI lifecycle

The process of developing and implementing an AI solution always starts with the business case. A business problem is defined and scoped and then turned into design requirements for data scientists – this is called the Business Understanding phase. Next, training and test data are collected in the Data Understanding phase. In the Data Preparation phase, the data is analyzed, sampled, and pre-processed for the chosen AI model. Then the Model Development phase can begin. In this phase, the AI model is trained, tuned, and evaluated. Once the Model Evaluation is optimized and meets the business requirements, the model can be tested and Deployed. This is when the AI system goes into production and performance is monitored. These phases are iterative.

Testing at each phase of the AI lifecycle

Inspired by the CRISP-DM Framework[3], QAIF is a cohesive, generic framework that can be tailored to any AI solution. It is designed to help product managers and business owners identify and mitigate potential risks at each stage of the AI lifecycle. The framework is governed by the EU ethics principles: Fairness, Transparency, Accountability, Traceability and Robustness. We add a gate to each phase – just like in traditional software testing – to ensure certain quality control checks are completed. The following paragraphs describe the quality control tests that should be conducted at each phase of the AI lifecycle to ensure quality and ethical adherence.

Figure: The QAIF overview.

Figure: The QAIF overview.

[3] Shearer C., The CRISP-DM model: the new blueprint for data mining, J Data Warehousing (2000); 5:13—22.

Business understanding (Gate 1)

In this phase, the tasks of identifying stakeholders, product requirement specifications, technical design specifications, performance metrics and ethical/legal compliance will be completed and understood for the development process to be initiated. To comply with ethics standards:

Under the traceability principle, lay the foundation for an MLOps based AI project[4], ensuring model and data version control in subsequent phases. MlFlow[5] is a useful package to consider for this principle as it provides support for “packaging data science code in a reusable and reproducible way”.

Under the fairness principle, define fairness metrics and detection methods if the business problem applies to sensitive groups. The fairlearn[6] package provides support for bias mitigation methods and metrics for a model fairness assessment which can be used in the Model Development phase.

Under the privacy principle, consider the possible ethical breaches before starting the project. Focus on data sources and privacy concerns and enable mitigation methods in the Data Preparation phase.

After these quality control checks are agreed upon, review best practices and next steps with all the relevant stakeholders.


Data Understanding (Gate 2)

This phase brings specifications from the first gate together with domain knowledge and experience, to understand inherent biases, assumptions, and privacy concerns in the collected data.

Under the traceability principle, data version control[7] is considered to set a documented specification with information on where the data is located, what kind of source it comes from, who are the responsible people, and if there are any privacy or quality concerns. To accomplish this, consider using the DVC python package[8] that is built on git version control.

Under the fairness principle, data collection methods are reviewed and checked for data adequacy (e.g. missing data for certain groups) and bias to prepare for deploying specific mitigation techniques in the Data Preparation phase.
Under the privacy principle, focus on assessing the data sources for personally identifiable information (PII) and consider using synthetic data to mitigate GDPR risk. For this, the Sogeti Artificial Data Amplifier[9] (ADA) can be used.

End this phase by completing a data audit on available data sources and corresponding responsibilities of the team.


Data Preparation (Gate 3)

The data engineering team, domain experts and model developers play crucial roles in the Data Preparation phase. Tasks like data mining, a data quality assessment with exploratory data analysis (EDA) and training data construction - define this phase’s process.
Under the traceability principle, data version control is once again executed, together with setting up a data pipeline to ensure complete transparency into the data preparation steps. For this, the TensorFlow input pipeline[10] API can be used to “build complex input pipelines from simple, reusable pieces; handle large amounts of data, read from different data formats, and perform complex transformations”.

Under the fairness principle, we deploy bias mitigation methods to ensure our training data is “judgement free” so our model will be as well. We can mitigate this by re-weighting the features of the minority group, oversampling the minority group or under-sampling the majority group. Imputation methods can also be used to reconstruct missing data to ensure the dataset is representative[11].

Under the privacy principle, we generate synthetic data to mitigate privacy risks or to boost our training data or if insufficient data is flagged as a limitation in the previous phase.

Provide an EDA report on the training data to a technical auditor for approval before moving to the next phase.


Model Development (Gate 4) 

The AI Model Development phase starts with high quality training data. Model developers have the main responsibility in this phase - ensuring that the AI model they are developing is suited for the application and works with the data prepared in phases 2 and 3. To be sure of this, performance metrics are drawn from the model and presented to the stakeholders. Furthermore, we test the model performance and functionality on the most granular level.

Under the traceability and auditability principle, use ML Flow for model versioning and log output. Split repositories and pipelines into development, acceptance, and production. Use a git version control system to push a new version of the code to different environments. Add required reviewers in each step to ensure traceability.

Under the fairness principle, assess model adequacy and model bias through adversarial debiasing[12] with generative adversarial networks (GANs), ensuring equal outcomes from all groups. IBM’s Trusted AI[13] is a python package that can be used here.

Under the robustness principle, test the model performance[14] on the most granular level and provide the accuracy scores, area under curve, F1 score, confusion matrix, mean square and absolute errors. Extend code coverage by unit testing your code[15] using the python unittest framework[16].

Finally, implement explainable AI (XAI) techniques like Lime[17] and SHAP[18] to understand model predictions. This adds transparency and interpretability to the model. XAI aims to mimic model behavior at a global and/or local level to help explain how the model came to its decision. SHAP (SHapley Additive exPlanations) is a method based on a game theory approach to explain individual predictions. LIME is model-agnostic and provides local model interpretability which means that it modifies a single data sample by tweaking the feature values and observes the resulting impact on the output. The output is a list of explanations, reflecting the contribution of each feature to the prediction of a data sample. This provides local interpretability, and it also allows to determine which feature changes will have most impact on the prediction. For the business owners and regulators, this explainable layer is especially important in understanding what is behind the ‘black-box’ model and prediction.

Complete this gate by providing a model quality report to a technical reviewer for approval.


Model Evaluation (Gate 5)

As the model’s already been validated on the most granular level in the previous phase, this gate’s tasks focus on ensuring that the model is transparent and works according to the business ethical considerations set in theBusiness Understanding phase. Being the most important phase in the QAIF, it ensures that the AI model is fair and understandable. Stakeholders include testers, developers, and the legal team.

Under the fairness principle, assess if the model is biased through metrics like[19]:

  • Statistical Parity Difference: The difference in the rate of favorable outcomes received by the minority group compared to the majority group.
  • Equal Opportunity Difference: The difference of true positive rates between minority and majority groups.
  • Average Odd Difference: The average difference of the false positive rate and true positive rate between minority and majority groups.
  • Disparate impact: The ratio of the rate of favorable outcomes for minority groups compared to majority groups.

Under the robustness principle, execute user acceptance tests using the XAI outputs that were implemented in the previous phase. Furthermore, execute metamorphic and adversarial tests[20] to ensure the model is robust enough to be deployed. Metamorphic tests aim to assess model impact by transforming the inputs of the model and then testing the model with the augmented inputs. Adversarial testing aims to generate adversarial attacks to stress test the model.

Provide a test report to a technical reviewer and an auditor for approval before moving to the last gate.


Model Deployment (Gate 6)

Once we have a transparent and understandable model, we can enter the final phase with a focus on monitoring, real-world model performance and maintenance.

To ensure robustness, fairness and transparency, a monitoring dashboard should be set up to track model performance in production. The production performance metrics include:

  • Model performance metrics from Model Development.
  • Bias metrics from Model Evaluation.
  • Drift detection metrics like concept drift and data drift detection[21]. According to IBM, “Model drift refers to the degradation of model performance due to changes in data and relationships between input and output variables”. This can potentially be a big production issue, so it is of crucial importance to use these metrics as indicators to when the model requires attention like re-training or when the training data should be re-weighted[22].

In this final phase, we can confidently check all the quality control boxes and pass through all the gates of development, but that doesn’t necessarily mean we are done with the AI lifecycle. When the model needs to be retrained or adjusted, we can always turn back and revisit any of the phases, as the AI project life cycle is an iterative process.

Figure: The practical, validation steps of each gate in the QAIF with their corresponding deliverables

Figure: The practical, validation steps of each gate in the QAIF with their corresponding deliverables


Building a trustworthy AI with Quality as a Service

The QAIF does not serve only as a theoretical guide to ensuring a trustworthy and high performing solution. Sogeti has also developed practical tools to be used by the data scientist along the QAIF, these can be embedded within AI systems. One of these tools includes the Data Quality Wrapper (DQW). The DQW serves as a data quality control check, to be used during the Data Understanding phase. The tool automates the EDA process by assessing tabular, image and text data. The tool uses various statistical methods and NLP techniques to assess quality. Other practical tools include the ExplainableAI Toolkit – an interpretable solution for any use case that combines techniques such as general linear rule models for tabular data, saliency maps for images and hierarchical explanations for neural networks.

With these practical tools, Sogeti can provide Quality as a Service; automatically embedding fairness and transparency into any AI model. For corporations adopting AI solutions, this should be a requirement and not a luxury.

The QAIF along with the practical toolkits, offer a structured and comprehensive way of working to develop and implement high-performing, ethical and quality-assured solutions; helping AI teams to design, develop and operate AI systems that the public can trust.

Learn more

For more information about detecting and mitigating bias. Read up on Programming Fairness into your AI model:

For implementing adversarial attacks read up on building more robust models with adversarial training:

About the authors

Tijana Nikolic

Tijana Nikolic

Tijana Nikolic is a Senior Data Scientist in Sogeti Netherlands AI CoE team. Her vision is to bring innovative solutions to the market with a strong emphasis on privacy, quality, ethics, and sustainability, while enabling growth and curiosity of team members. One of those solutions is the Artificial Data Amplifier, the winner of the Sogeti Innovation of the year award for 2020. She is one of the leads of a project that focuses on validation of AI components and using AI to accelerate testing. Tijana carries the title of Young Sogetist of the Year 2020 and is part of the YS EduNite committee which focuses on educating and uniting colleagues in Sogeti.

Almira Pillay

Almira Pillay

Almira Pillay is an Artificial Intelligence Specialist in Sogeti Netherlands AI CoE. She is one of the pioneering members of the Artificial Data Amplifier solution and remains as a technical lead. Almira is responsible for the delivery of AI solutions and bringing new product innovations to the market such as CodeAssist and DevAssist. She is one of the leads of the AI for Good community which focuses on sustainable AI solutions, ethical concerns and inclusive AI. Apart from numerous blogs and whitepapers, Almira has contributed to the Capgemini Research Institute ‘AI & the Ethical Conundrum report’ which aims to address AI ethical implications.

About Sogeti

Part of the Capgemini Group, Sogeti operates in more than 100 locations globally. Working closely with clients and partners to take full advantage of the opportunities of technology, Sogeti combines agility and speed of implementation to tailor innovative future-focused solutions in Digital Assurance and Testing, Cloud and Cybersecurity, all fueled by AI and automation. With its hands-on ‘value in the making’ approach and passion for technology, Sogeti helps organizations implement their digital journeys at speed.

Visit us at

Capgemini is a global leader in partnering with companies to transform and manage their business by harnessing the power of technology. The Group is guided everyday by its purpose of unleashing human energy through technology for an inclusive and sustainable future. It is a responsible and diverse organization of 325,000 team members in nearly 50 countries. With its strong 55 year heritage and deep industry expertise, Capgemini is trusted by its clients to address the entire breadth of their business needs, from strategy and design to operations, fueled by the fast evolving and innovative world of cloud, data, AI, connectivity, software, digital engineering and platforms. The Group reported in 2021 global revenues of €18 billion.
Get the Future You Want!