State of AI applied to Quality Engineering 2021-22
Section 4.2: Automate & Scale

Chapter 3 by Kobiton

Use AI to assess how beautiful your app is

Business ●○○○○
Technical ●●●●○

Listen to the audio version

Download the "Section 4.2: Automate & Scale" as a PDF

Use the site navigation to visit other sections and download further PDF content

By submitting this form, I understand that my data will be processed by Sogeti as described in the Privacy Policy.*

Without domain expertise in areas such as user interface (UI) design, members of the QA team struggle to assess some aspects of quality, such as the user experience. In this chapter, we examine how machine learning models can assist us in the evaluation of the user interface design, with a specific focus on mobile device rendering. These models can also be applied to other areas of testing that require domain expertise.

We’ve all installed a mobile app or brought up a website that incorrectly renders on our mobile device. The impact on the user experience varies. On the low impact side, it’s an innocuous user interface issue like overlapping text. More severely, an incorrectly rendered UI can break functionality and prevent the user from performing basic application activities, such as, when the keyboard pops up and occludes the login button. At best, the user leaves with a lack of trust because the user interface looks unprofessional. However, in the age of app store reviews, UI issues that break existing functionality result in low app store ratings and negative reviews, leaving permanent stains on what may be an otherwise excellent app store presence.

It is essential to assess the mobile app user interface against this large spectrum of potential consequences, ranging from harmless to functionality-breaking. It is often said that beauty is in the eyes of the beholder. In today’s mobile-driven economy, with millions of users beholding your mobile app and each with a different perspective on what is beautiful, it is vital to adhere to certain UI design principles. Unfortunately, most quality engineers do not receive adequate training on these principles and asking them to perform a quality assurance check on the user interface design is unrealistic.

Along with their varying perspectives, the "lens" through which they view your mobile app varies as well. The “lens” in the mobile world includes:

  • The device type,
  • The operating system version,
  • The language, and
  • The font size settings.

Even if you’ve tested your mobile app on a Samsung Galaxy S1 with a 2400x1080 pixel screen resolution, a user with a OnePlus 8 -- also with a 2400x1080 pixel screen resolution -- set to increased system font size may encounter an unpleasant user experience.

We will now present how machine learning models can assist the quality engineer in performing quality checks on the user interface design, assisting in determining what makes a rendered mobile application beautiful. We've discovered that our machine learning-based method not only enhances the user interface across device kinds, but also begins to train the quality engineer on what to look for when visually reviewing an application's user interface design.

While the model's ultimate goal is to suggest specific UI components that should be changed to improve an app's aesthetics, we are now decomposing that goal by analyzing individual UI features. Examples of UI features are font size, viewport horizontal margins, and value contrast between text and its background color. We will use font size for explanation purposes, although we may take a similar approach for the other user interface features.

The Figure below shows an example of an anomaly we are trying to detect with the Font Size Model in the REI Mobile App login screen. On the right side, the REI app renders the screen on a Samsung Galaxy Note8 in a visually appealing manner. The left side image shows how the screen renders on a Galaxy J5 with a few visual defects, including an insufficient margin to the device edge, “No thanks” text occluding the “Log in” and “Create account” buttons, and the font size is too large. Our Font Size Model will automatically detect that the font size on the right is too large for the given screen size.

Figure: Side-by-side comparison Galaxy J5 and Note 8


 Figure: Screen renders on Galaxy J5 720 x 1280 pixels, 16:9 ratio and Note 8 1440 x 2960 pixels, 18.5:9 ratio


Our Font Size Model is an anomaly detection model, a binomial classifier with two normal and abnormal classes. To accurately represent the complex feature relationships that influence font size, we chose a supervised model. Supervised models require a significant amount of training data -- the more data, the better. Due to the impossibility of manually categorizing hundreds of thousands of screens, we needed to build a more scalable strategy based on semi-supervised training.

Obtaining and curating training data

Our final training data set contains a distribution between normal and abnormal font sizes across the App Store apps. We start, however, with a training set of only normal font sizes and build a statistical model that accelerates the annotation process. We crawl 50 apps in the App Store across 35 devices (iOS and Android) to capture 15,034 screens and 54,393 text blocks to build the statistical model. To identify mobile apps that render normal font sizes across different screen resolutions, we chose the apps based on popularity, app store rating, and a positive “user interface” sentiment of App Store reviews. While these apps may still contain abnormal text blocks on certain devices, our statistical model and curation process reduces our data contamination risk.

The same side-by-side comparison in the figure below demonstrates how one of the selected apps,, renders font sizes across the same two devices used in the REI app comparison -- Galaxy J5 and Note 8. On average, the J5 font size is 10% smaller than the equivalent text on the Note 8. In contrast, the REI app’s J5 font size is 34% larger.

Figure: Screen renders on Galaxy J5 720 x 1280 pixels, 16:9 ratio and Note 8 1440 x 2960 pixels, 18.5:9 ratio


Figure: Side-by-side comparison Galaxy J5 and Note 8


The statistical model we create from the crawled data calculates the percentage font height difference for the same text between two device types. Figure 3 shows the font height binned distribution in percentage terms between the Note 8 and J5. The mean and standard deviation values when comparing text between Note 8 and J5 are 91.87% and 4.59%, respectively. The mean and standard deviation for each combination of device type comparisons forms the definition of normal for our semi-supervised training data.

To further accelerate the training data curation process, we create a web page for annotators to validate our statistical model results and determine the optimal ‘cut point’ that delineates normal from abnormal. This annotation page is used throughout model creation to validate our training data and model results.



Figure: J5 font size as a percentage of the Note 8


As shown in figure above, the mean J5 font size as a percentage of the Note 8 font size is 91.9% for the 691 text blocks captured on both of these devices in our crawling sample set. Furthermore, none of the 691 text blocks render in a way that is considered abnormal upon visual inspection, even the text block that was greater than three standard deviations from the mean at 76%. This analysis indicates that visually abnormal records require a significant deviation from the mean. When we apply the statistical values, we obtain from this sample to the REI app, the difference between the Note 8 and J5 is over seven standard deviations from the mean.

The process of crawling normal text blocks across different device types has helped identify what is confidently normal. However, except for the REI example, we are no closer to determining what is considered abnormal. The 50 normal app analysis provides us with a reference distribution that will reduce our labor-intensive annotation process. We run the app crawler on 200 additional random apps in the App Store, which yields 38,722 screens and 147,902 text blocks.

Figure: Comparation of the distribution between our two reference device types, J5 and Note 8


The figure above compares the distribution of the 50 “normal” apps with the distribution of the 200 random apps between our two reference device types, J5 and Note 8. The purple bins in the figure represent the binned counts of the 200 random apps. The text blocks from the randomly selected apps exhibit much higher variability than the text blocks from the 50 normal apps. All records outside the five standard deviations, for which there were nine, rendered visually abnormal on the J5 (7) or the Note 8 (2). And, all records inside three standard deviations rendered visually normal. So, we knew our starting ‘cut point’ should be greater than three and less than five. As the goal of this process was to accelerate the correct labeling of data, we decided on a ‘cut point’ of 3. Any records outside of three standard deviations from the mean are manually labeled. In addition, as a quality check, 5% of records between two and three standard deviations are manually labeled.

The 35 device combination comparisons across the 250 crawled apps yield 206,803,972 text block comparisons. By taking this approach to curating our training data, we cut our estimated labeling time from 2,992 person-days to 30 person-days!

Our curated training data now yielded 206,114,631 normal records and 689,341 abnormal records, providing sufficient training data for our model with minimal training data contamination.

Feature Selection

Now that we have a large corpus of training data, we will choose the features to train our model. While some of the features we choose may not significantly impact the model, we’ll be liberal in our feature selection and let the model do its job.

We focused on the font height variance between just two device types to identify anomalous variances during the training data curation process. Annotators then determined which of the two device types was anomalous for the specified text block. As we select features to train our deep learning model, two observations will improve our feature selection. First, our training data curation process detected abnormal variances but was unable to determine which of the two device types was anomalous; we relied on human annotators to make that decision. We want to train our model to be robust enough to identify the culprit device type without human review accurately. Second, if the font size rendered incorrectly on just one specific device, our training data curation process yielded 34 anomalies, one for each of the comparisons with the other of the 35 device types in our training device pool. We want our trained model to identify the specific device that renders abnormally to reduce the review effort in the front-end application.

As we are identifying visual anomalies, we choose predominantly visual features. Specifically, the features we choose are:

  • the rendered font size on-screen (in millimeters)
  • the screen width (in millimeters)
  • the screen height (in millimeters)
  • the screen pixels per inch
  • the x pixel coordinate of text block left-side
  • the y pixel coordinate of text block top-side

As we discuss in the next section, our model choice influences how we structure these features as input to the model.

Model Selection

Next up is choosing the machine learning model that performs best for our data. We have plenty of model options from which to choose, so we first narrow down based on the following:

  • How confident that training data is correctly labeled -- specific models, like One-Class SVMs, are more resilient to data contamination;
  • Volume of training data -- deep learning models perform significantly better with a large amount of data;
  • Number of features -- higher dimensionality inputs will perform better with a deep learning model; and
  • Labeling for normal and abnormal training data would be limited to specific models if only the normal training data were labeled.

Because we have labeled a large corpus of normal and anomalous data, we can leverage a deep learning model that exploits the high dimensionality of our input data. We choose a binomial classifier, where the two classes are normal and abnormal.

As shown in the figure below, our neural network contains an input layer, two hidden layers with 32 nodes each, and an output layer. We train the model using a Keras Sequential model. The nodes in the hidden layer use Rectified Linear Units (ReLU) as the activation function, and the output layer uses sigmoid as the activation function to decide between the normal and abnormal classes. We train using binary_crossentropy as the loss function and the Adam optimizer for 200 epochs with a batch size of 16. We ran the training using the same model structure with the loss function set to mean_squared_error, and the binary_crossentropy yielded more accurate results.

Figure: Keras Sequential model


Figure: Keras Sequential model


Model Training and Results

The model yields classification accuracy on the train and test data of 92.8% and 94.8%, respectively. When we dissect the accuracy (F-measure) on the test set, recall is 98.9% and precision is 95.8%.

Kobiton presents anomalous results to the user, and our model identifies all but 1.1% of the anomalous text blocks. If these false negatives cause significant visual defects, that would be detrimental to the trust in the Kobiton product.

On the false-positive side, 4.2% of the anomalies reported to the user are normal. The Kobiton user interface allows the user to mark these records as normal, and we then feel these corrected labels back into our model.

Figure: The Kobiton user interface


Figure: The Kobiton user interface


Our semi-supervised machine learning model for anomaly detection of font sizes across different mobile device screen resolutions performs greater than expected. To further increase our model’s accuracy, as we add more training data, we can increase the number of hidden layers and nodes within each hidden layer. We can also fine tune the model’s hyperparameters and experiment with the structure of each node.

We have started applying similar techniques to twenty-eight additional user interface features like text alignment and margin to the screen edge. When we complete implementing all of these models, we will be able to run any app against Kobiton and recommend the specific changes to make that app beautiful.

About the author

Frank Moyer

Frank Moyer

Frank is a 25-year technology industry veteran with a track record of building value in startups and exiting successfully. As CTO of Kobiton, Frank sets the product and technology direction for the company.

Prior to Kobiton, Frank served as CEO for GeoIQ, a geospatial analytics company founded in 2004 serving commercial and government markets. Under Frank’s leadership, the company tripled its revenue prior to being acquired by its biggest competitor, ESRI. Prior to GeoIQ, Frank was CEO of EzGov, one of the pioneering companies for government online transactions founded in 1999. Frank grew the company between 2005 to 2009 to $28 million in revenue and $7m in operating profit. EzGov sold to CACI in June 2009. Prior to that Frank was an executive within Accenture’s Technology Practice. Frank holds a B.A. in Economics and a B.S. in Computer Science from Columbia University in New York.

About Kobiton

Kobiton is mobile device testing for quality-obsessed customers. Our fleet of real devices and flexible deployment options help QA and development teams create the perfect mobile experiences for their end-users. At Kobiton, we understand how important customer experience is to brand perception, and want nothing more than to help companies reduce app abandonment and accelerate delivery while increasing the joy users experience with their product. Whether you need to expand device coverage with access to our public cloud, tame device chaos with our private and local cloud, or simply increase your team's bandwidth using our automated health check feature, Kobiton has something for every team striving to create a memorable CX.

Visit us at