Our final training data set contains a distribution between normal and abnormal font sizes across the App Store apps. We start, however, with a training set of only normal font sizes and build a statistical model that accelerates the annotation process. We crawl 50 apps in the App Store across 35 devices (iOS and Android) to capture 15,034 screens and 54,393 text blocks to build the statistical model. To identify mobile apps that render normal font sizes across different screen resolutions, we chose the apps based on popularity, app store rating, and a positive “user interface” sentiment of App Store reviews. While these apps may still contain abnormal text blocks on certain devices, our statistical model and curation process reduces our data contamination risk.
The same side-by-side comparison in the figure below demonstrates how one of the selected apps, booking.com, renders font sizes across the same two devices used in the REI app comparison -- Galaxy J5 and Note 8. On average, the J5 font size is 10% smaller than the equivalent text on the Note 8. In contrast, the REI app’s J5 font size is 34% larger.
Figure: Side-by-side comparison Galaxy J5 and Note 8
The statistical model we create from the crawled data calculates the percentage font height difference for the same text between two device types. Figure 3 shows the font height binned distribution in percentage terms between the Note 8 and J5. The mean and standard deviation values when comparing text between Note 8 and J5 are 91.87% and 4.59%, respectively. The mean and standard deviation for each combination of device type comparisons forms the definition of normal for our semi-supervised training data.
To further accelerate the training data curation process, we create a web page for annotators to validate our statistical model results and determine the optimal ‘cut point’ that delineates normal from abnormal. This annotation page is used throughout model creation to validate our training data and model results.
Figure: J5 font size as a percentage of the Note 8
As shown in figure above, the mean J5 font size as a percentage of the Note 8 font size is 91.9% for the 691 text blocks captured on both of these devices in our crawling sample set. Furthermore, none of the 691 text blocks render in a way that is considered abnormal upon visual inspection, even the text block that was greater than three standard deviations from the mean at 76%. This analysis indicates that visually abnormal records require a significant deviation from the mean. When we apply the statistical values, we obtain from this sample to the REI app, the difference between the Note 8 and J5 is over seven standard deviations from the mean.
The process of crawling normal text blocks across different device types has helped identify what is confidently normal. However, except for the REI example, we are no closer to determining what is considered abnormal. The 50 normal app analysis provides us with a reference distribution that will reduce our labor-intensive annotation process. We run the app crawler on 200 additional random apps in the App Store, which yields 38,722 screens and 147,902 text blocks.
The figure above compares the distribution of the 50 “normal” apps with the distribution of the 200 random apps between our two reference device types, J5 and Note 8. The purple bins in the figure represent the binned counts of the 200 random apps. The text blocks from the randomly selected apps exhibit much higher variability than the text blocks from the 50 normal apps. All records outside the five standard deviations, for which there were nine, rendered visually abnormal on the J5 (7) or the Note 8 (2). And, all records inside three standard deviations rendered visually normal. So, we knew our starting ‘cut point’ should be greater than three and less than five. As the goal of this process was to accelerate the correct labeling of data, we decided on a ‘cut point’ of 3. Any records outside of three standard deviations from the mean are manually labeled. In addition, as a quality check, 5% of records between two and three standard deviations are manually labeled.
The 35 device combination comparisons across the 250 crawled apps yield 206,803,972 text block comparisons. By taking this approach to curating our training data, we cut our estimated labeling time from 2,992 person-days to 30 person-days!
Our curated training data now yielded 206,114,631 normal records and 689,341 abnormal records, providing sufficient training data for our model with minimal training data contamination.