How a Scots Wikipedia scandal highlighted AI’s data problem

Nicolás Rivero

Thu, Aug 27, 2020, 2:42 PM3 min read

A laptop screen displays the Wikipedia homepage.

Most of the English language technology you use on a daily basis—voice assistants, spell checkers, translation tools, search functions—share a common origin story. They’re built using AI language models, and many of those models are trained on millions of Wikipedia articles.

But a bizarre discovery made this week by a Scottish Reddit sleuth has highlighted a worrisome problem for that data pipeline. Most of the Scots language edition of Wikipedia was written by an American teenager who doesn’t actually speak the language. Instead, the teen wrote tens of thousands of articles in English with a put-on Scottish accent, ignoring actual Scots grammar and vocabulary.

Wealthy Nigerians are buying up passports for cash from Caribbean nations to beat visa rules

For a low-resource language like Scots, which has few digital archives of written text to pull from, it could mean that some models base their entire understanding of the language on the phony version written in the Scots Wikipedia. That limits the amount of access native speakers have to tech tools in their language.

“I don’t think people necessarily realize how important Wikipedia is for training all of our language technologies,” said David Yarowsky, a computer science professor at Johns Hopkins University. “When these problems crop up, it really is impacting our ability to do a high quality job on the technologies that these communities want.”

The Scots Wikipedia is a rather unique example of a misguided editor dedicating a decade to writing out articles in what they believed to be (but absolutely was not) genuine Scots. More often, Yarowsky says, the issue is that Wikipedia editors decide to fill out a language edition with machine-translated text that is not corrected by fluent speakers. The second largest Wikipedia edition behind English—the Cebuano edition, catering to just 16 million speakers primarily in the Philippines—was almost entirely written by a single bot. Unlike Scots speakers, Cebuano speakers aren’t likely to have language AI tools available in another language they speak just as well.

People who see God as white are more likely to see white job applicants as leaders

A bot-based translation approach can create a vicious cycle when future algorithms use those Wikipedia pages as training data. “You’re basically learning from a bad version of what you already have,” Yarowsky said. “If we train machine translation systems on machine translation output, we are just reinforcing existing failings like an echo chamber.”

Shoddy Wikipedia pages can also make it harder to verify that language algorithms actually work. Jeanna Matthews, a computer science professor at Clarkson University, says that models are often tested against Wikipedia data, meaning that AI tools built for languages with high-quality entries, like English, keep improving, while others don’t.

“The people who develop these tools can work out the bugs and kinks for languages that are well-represented,” she said. “It’s a snowball effect, and the advantages for these languages get further amplified.”

There is, however, a clear way to break the cycle: “One of the best things that the community can do is write a lot in their language and post it online, whether as news or stories or Wikipedia articles or discussion forums,” Yarowsky said. That way language researchers will have good text data from fluent speakers to incorporate into their models.

Scots speakers have begun to organize to do just that. Seventy-four people have joined a new “Scots Wikipedia editors” Facebook group, and they’ve scheduled their first “editathon” to begin rewriting articles for Aug. 30.

More stories from Quartz:

$433 Billion Gone! One Stock Loses More Value Than Tesla
Tesla's loss of $328.3 billion this year in stock value certainly hurts. But it's only the second-largest market value loss in the S&P 500.
Investor's Business Daily•1d ago
Mark Zuckerberg got $24.4 million in ‘other compensation’ in 2023—but Meta also treated staff well, with the median employee making $379,000
On paper, Mark Zuckerberg is Meta’s lowest-paid employee, with a $1 dollar salary and no bonus.
Fortune•1d ago
The Vanguard 500 Index Fund Is Great, But Another Vanguard ETF Has Outperformed It the Past Decade
This growth ETF has nicely outperformed the S&P 500 since its inception.
Motley Fool•2h ago
Billionaire Bill Ackman Has His Sights on Only 1 "Magnificent Seven" Stock, and It's Not Nvidia
You'll only find seven different stocks in this billionaire's portfolio, and only one of them is a "big tech" company.
Motley Fool•2h ago
Bitcoin Just Did Something It Has Only Done 3 Times Before. The Cryptocurrency Usually Does This Next.
Bitcoin block subsidies were recently reduced by half for the fourth time in history, and halving events have always led to significant price appreciation.
Motley Fool•3h ago
Analysts reset Microsoft stock price targets ahead of highly anticipated earnings
Microsoft, which overtook Apple as the world's most valuable company earlier this year, is looking to cement its AI market leadership.
TheStreet•13h ago
One Member of Congress Is Going Against the Grain and Selling This Skyrocketing Stock-Split Stock
One of Capitol Hill's most-active stock traders is sending a company with well-defined competitive advantages -- that also happens to be on the verge of its first-ever stock split -- to the chopping block.
Motley Fool•3h ago
China's Temu Takes Over 17% Of US Market Share, Cutting Jobs From American Amazon And Decimating Small Businesses
With rising inflation, American consumers are increasingly turning to the Chinese e-commerce platform Temu for their shopping needs. With its enticing tagline “Shop like a billionaire,” Temu has captured 17% of the U.S. market share, posing a challenge to traditional American retailers such as Amazon.com Inc., Dollar Tree Inc. and Five Below Inc. The rise highlights the lucrative and disruptive nature of startups. Owned and operated by PDD Holdings Inc. (NASDAQ:PDD), Temu offers a wide range of
Benzinga•15h ago
3 Absurdly Cheap Stocks to Buy and Hold for Years
These stocks all deserve to be trading at far higher prices, and it may not be too long before they start to rally.
Motley Fool•2h ago
History Says the Nasdaq Will Roar Higher This Year. My Top Growth Stock to Buy Before It Does.
This stock is reasonably priced, even after explosive gains.
Motley Fool•2h ago

News

Life

Entertainment

Finance

Sports

New on Yahoo

Yahoo Finance

How a Scots Wikipedia scandal highlighted AI’s data problem

Recommended Stories

$433 Billion Gone! One Stock Loses More Value Than Tesla

Mark Zuckerberg got $24.4 million in ‘other compensation’ in 2023—but Meta also treated staff well, with the median employee making $379,000

The Vanguard 500 Index Fund Is Great, But Another Vanguard ETF Has Outperformed It the Past Decade

Billionaire Bill Ackman Has His Sights on Only 1 "Magnificent Seven" Stock, and It's Not Nvidia

Bitcoin Just Did Something It Has Only Done 3 Times Before. The Cryptocurrency Usually Does This Next.

Analysts reset Microsoft stock price targets ahead of highly anticipated earnings

One Member of Congress Is Going Against the Grain and Selling This Skyrocketing Stock-Split Stock

China's Temu Takes Over 17% Of US Market Share, Cutting Jobs From American Amazon And Decimating Small Businesses

3 Absurdly Cheap Stocks to Buy and Hold for Years

History Says the Nasdaq Will Roar Higher This Year. My Top Growth Stock to Buy Before It Does.