Automatic detection of mislabeled language

If you want to create a more accessible internet, you must make sure that screen readers and assistive technologies speak the right “lang.”

At Evinced, we’ve created a new language attribute validation that can help, by detecting mislabeled language for all website text.

Why “lang” matters

In HTML, the lang attribute specifies the language of the content in a given element. Most familiarly, it’s declared for the entire page right at the start:

<html class="main" lang="en">

But it can also be declared for an individual element, like a paragraph.

<p lang="en">The little brown fox jumped and how.</p>

Either way, it’s important for screen readers and other assistive technologies, as it ensures content can be pronounced properly when read aloud to the user.

For example, to specify a paragraph is in French, the lang attribute must be set to “fr,” and the text that follows would be presumed to be in French by screen reader software. For example:

<p lang="fr">Bonjour, comment ça va?</p>

What happens when the wrong ‘lang’ attribute is assigned to text? In the example above, there’s a big pronunciation difference between “comment” in English and in French, and an English speaker wouldn’t know a cedille (the “¸” in “ça”) from a porcupine. In general, a mislabeled lang attribute can drastically and negatively impact the clarity of the read performed by the screen reader software.

Let’s show some real-world examples. Here, we’ve recorded the output from screen reader software on the same text — both when the lang attribute is misspecified, and when it’s specified correctly:

Example 1. English Text Mis-Specified as French

Example 2. The Same English Text Correctly Specified as English

Not a small difference, right? If accessibility is what you’re after, it’s essential that the lang attributes be specified correctly. It’s both common sense and called for directly in the Web Content Accessibility Guidelines.

The most common language mislabelings

The guidelines are clear, yet there are two common things web developers need to correct when coding languages.

Often a page will be written in one language but labeled as a different one. For example, a company might have an “About Us” page in French, but the language is mistakenly declared as English in the HTML.

Another common error is hosting a page that is mostly one language but has text in another language dotted throughout. Let’s say an “About Us” page is in French, and French is the language (correctly) declared for the page in the HTML, yet there are still some existing words of text and terms that are English—such as “Terms of Use,” “Cookie Policy,” or “Chat with Sales.” These standard terms are often baked into the code early on and overlooked when coding pages with new languages. It’s a recipe for screen reader confusion.

Our goal: detect WCAG language violations

At Evinced, we know mislabeled language attributes are common mistakes, so we’ve built a way to detect these WCAG language violations. We call it language-attribute-mismatch.

That isn’t to say it was easy. In fact, it was anything but.

We had to overcome a massive technical challenge involving crawling and collecting website text data, scrubbing it for information that’s irrelevant to screen readers, and labeling each text with its inherited “lang” attribute—its HTML-assigned language. The hardest part was detecting the actual language of the text, as the HTML file itself could have specified an incorrect language.

Luckily, these types of challenges aren’t new to our team, as the mislabeling of HTML files is fairly common. We know how to search for both text functionality and correctness to ensure the WCAG guidelines are met.

The how: natural language processing

How do we do it—what algorithm finds the actual language of a text?

Detecting language is a difficult task that falls into the natural language processing (NLP) domain. NLP is a subfield of linguistics, computer science, and artificial intelligence with interactions between technology and human subject matter experts. It requires developing algorithms and models to analyze and interpret natural language data in order to build natural language processing applications. The result is the creation of systems and software that allow computers to read, understand, and generate human language. The products of NLP work are all around us, from bot detection and email spam filtration to Siri on the iPhone and smart home devices.

So you might wonder if there were a way to leverage existing technologies for our purposes.

For example, you’re probably familiar with Google Translate and know that if you type “Où es la tour Eiffel?” into that tool, it will usually recognize what you type as French. So, might there be some API from Google that could have been used here?

The answer is yes and no.

Yes, there are third-party APIs from Google and others for language translation, but they have significant shortcomings for a task like ours.

They have problems with names, addresses, and professional vocabulary.
In our analysis, existing APIs seem skewed toward predicting “English” more often than other languages.
They can pose privacy and security challenges, since they require calls to third-party servers which transport the text in question across different servers.

The challenge of false positives

There have been significant technological advancements over the last few years, and today many sophisticated (often open-source) machine-learning methods can help create natural language processing applications. Still, existing language models make mistakes, and as with, say, Optical Character Recognition (OCR) systems in the past, those mistakes can add up to be costly. Even small error rates can accumulate huge numbers of problems for a typical website.

To understand the full scale of this issue, first, it’s important to understand that NLP models scan web pages seeking errors in “texts”, which is a term developers use to mean “word,” “sentence,” or even “paragraph.”

Now, let’s say we run a model that generates an average of one mistake – one mislabeled text – per webpage, on a webpage that averages 100 “texts.” That’s 99% accuracy. But many websites contain tens of thousands of pages. What would happen then? Even a 1% error rate would yield thousands of false positives. That’s a distraction that might keep a company from solving real accessibility issues.

Our solution

We took an unorthodox approach. To reduce false positives, we employed two different language models simultaneously. Our team leveraged language models that utilize both old and new recognition methods, from neural networks to some advanced statistical analysis. Normalizing the output from two different models requires a ton of extra work, but we felt like the upside was worth it.

In addition to using two models, we included innovative heuristics tailor-made for our task. Essentially, we developed an entire software layer that checks all of our results and filters out potential false positives, and resolves variances generated by the two-model approach.

What’s more, we did this without overspecifying the model. This also required a fair bit of innovation, but it means we’ve created a general-purpose tool that will work for any of our customers, regardless of the type of content or website.

The result

The result is some very good news. We’ve tested our approach extensively, across thousands of websites and tens of thousands of scans. It turns out that all of our extra refinements are effective. Accuracy is outstanding, and false positives are dramatically fewer than in traditional, single-model approaches.

If you’re a customer with content in multiple languages (whether in part or whole), our approach is already helping make your web pages much more accessible. And if you’re a potential customer, feel free to get in touch.