Scaling AI to long web pages

Facebook
LinkedIn

August 18, 2025

Website classification is the silent engine behind many digital services — from filtering harmful content to powering search relevance and compliance enforcement. But as web content grows longer and more complex, even state-of-the-art AI models struggle to keep up.

At the heart of this challenge is a simple fact: transformer-based models like BERT and RoBERTa can’t handle long web pages without cutting corners. But what if there were a smarter way to help them do just that — without adding computational overhead or reinventing the architecture.

That’s exactly what we set out to solve.

The Problem: AI Models Aren’t Built for Long Sequences

Models like BERT and RoBERTa have revolutionized NLP, but they have a hard cap: they can only process inputs up to 512 tokens. For web pages — which often contain thousands — this means truncating content and losing context. Some recent models (like Longformer or BigBird) were designed to handle longer texts, but they come with high computational cost and longer training times.

For companies that rely on real-time classification or need to scale across thousands of domains, these solutions aren’t always practical.

Our Solution: Weighted Stratified Split Approach (WSSA)

Rather than modifying the model architecture, we took a data-centric approach. We created a lightweight preprocessing technique that works with existing transformer models and boosts their performance on long web pages.

Here’s how it works:

Chunking: Each long web page is split into multiple chunks of ~500 tokens — small enough to fit into standard transformers.
Smart Sampling: We apply a weighted stratified split that balances the training data by page length and category distribution. This avoids biasing the model toward short or long pages.
Chunk Voting: At inference, each chunk contributes a category prediction (‘votes’), and the final label is based on majority voting — giving the full page a voice without overwhelming the model.

This method allows BERT and RoBERTa to handle longer content more effectively, while keeping training and inference times reasonable.

Why It Works — And What We Found

We tested this approach on a real-world dataset of over 3,000 websites across 10 categories. Compared to traditional data splitting methods, WSSA led to:

Up to 4% higher accuracy for standard BERT and RoBERTa models.
Faster fine-tuning compared to Longformer and BigBird.
Lower inference latency, enabling near real-time classification.

Even more interesting: our lightweight setup outperformed specialized long-document transformers like Longformer and BigBird in many cases — especially when evaluating both the index page and surrounding web pages.

Business Benefits: Efficiency Without Compromise

For enterprise teams working on:

Web security and filtering
Content compliance and categorization
Information retrieval and user personalization

…this approach offers a high-accuracy, resource-efficient solution that scales.

By leveraging the WSSA strategy, businesses can continue using reliable transformer models (like BERT) without needing heavy infrastructure upgrades or long retraining cycles. It’s a plug-and-play boost for classification performance, ideal for teams looking to do more with less.

What’s Next

We’re actively exploring how this approach extends beyond websites to domains like document classification, legal text analysis, and even long-form chat transcripts.

As models get smarter and data gets longer, it’s clear that smart preprocessing can be just as powerful as smart architectures. And with WSSA, we’re bringing that power to the forefront — helping businesses unlock the full potential of AI, chunk by chunk.