Innovative solutions driving your business forward.
Discover our insights & resources.
Explore your career opportunities.
Learn more about Sogeti.
Start typing keywords to search the site. Press enter to submit.
Generative AI
Cloud
Testing
Artificial intelligence
Security
August 18, 2025
R&D Project Manager
At the heart of this challenge is a simple fact: transformer-based models like BERT and RoBERTa can’t handle long web pages without cutting corners. But what if there were a smarter way to help them do just that — without adding computational overhead or reinventing the architecture.
That’s exactly what we set out to solve.
Models like BERT and RoBERTa have revolutionized NLP, but they have a hard cap: they can only process inputs up to 512 tokens. For web pages — which often contain thousands — this means truncating content and losing context. Some recent models (like Longformer or BigBird) were designed to handle longer texts, but they come with high computational cost and longer training times.
For companies that rely on real-time classification or need to scale across thousands of domains, these solutions aren’t always practical.
Rather than modifying the model architecture, we took a data-centric approach. We created a lightweight preprocessing technique that works with existing transformer models and boosts their performance on long web pages.
Here’s how it works:
This method allows BERT and RoBERTa to handle longer content more effectively, while keeping training and inference times reasonable.
We tested this approach on a real-world dataset of over 3,000 websites across 10 categories. Compared to traditional data splitting methods, WSSA led to:
Even more interesting: our lightweight setup outperformed specialized long-document transformers like Longformer and BigBird in many cases — especially when evaluating both the index page and surrounding web pages.
For enterprise teams working on:
…this approach offers a high-accuracy, resource-efficient solution that scales.
By leveraging the WSSA strategy, businesses can continue using reliable transformer models (like BERT) without needing heavy infrastructure upgrades or long retraining cycles. It’s a plug-and-play boost for classification performance, ideal for teams looking to do more with less.
We’re actively exploring how this approach extends beyond websites to domains like document classification, legal text analysis, and even long-form chat transcripts.
As models get smarter and data gets longer, it’s clear that smart preprocessing can be just as powerful as smart architectures. And with WSSA, we’re bringing that power to the forefront — helping businesses unlock the full potential of AI, chunk by chunk.
To support agentic AI at scale, organizations need orchestration, governance, and modular architecture that ensures trus…
From self-reflection to collaboration, core agentic patterns enable AI systems to learn, plan, and act with purpose in d…
Agentic design patterns help organizations scale AI systems, maintain control, and ensure seamless integration across co…