Why There Will Never Be a List of Every Word

This afternoon, @everyword will send its final missive. The Twitter account has published what it claims to be every word in the English language, one tweet per word, since 2007. It’s now counting the final Zs, and the rest will be silence.

Twitter is now crowded with bots. There’s one that mashes headlines together and another that posts pictures from the Metropolitan Museum of Art and a third that finds tweets that are anagrams of each other. Last summer, a famous bot was revealed not to be a bot at all, and the New Yorker took note.

But last year, the New York Review of Bots, implied @everyword was the very best bot. And now it’s ending.

Ruth Spencer of the Guardian recently talked to its creator, the programmer and poet Adam Parrish. “Where,” she asked, “does the library of words you use come from?”

Parrish’s reply, I think, gets to what makes @everyword so interesting:

I honestly don't remember. It's a list of words that I downloaded from a website somewhere. It's not the OED. One of the purposes of @everyword is to raise the question of whether it's possible to have a canonical list of the English language. To me, the obvious answer is no. We come up with new words all the time. We have rules about what can and cannot be words and linguists don't know where to draw the line any more.

Parrish has also written his own @everyword postmortem. He says he hopes to run a “Season 2” for the bot with a more complete word list.

Of course, giving the bot a larger vocabulary will only itself intensify the bot’s existential question. The German mathematician Gregor Cantor proved that there could be larger and smaller infinities that were both, still, infinities; a more prolix Every Word will only make its lacunae more noticeable.

Because—make that list longer and longer—and you will run into an old problem: No one’s sure what a word is, exactly.

For over a century, some scholars have claimed that Shakespeare used more words than any other writer, that his vocabulary dwarfed his era’s fellow English-speakers. The number of words he deployed, some insisted, is even double that of modern-day speakers. In 1986, a famed and Emmy-winning PBS documentary, The Story of English, alleged: “Shakespeare had one of the largest vocabularies of any English writer, some 30,000 words. Estimates of an educated person’s vocabulary today vary, but it is probably about half this, 15,000.”

Could that be true? It depends what you mean by vocabulary. As Ward Elliott and Robert Valenza write in their paper, “Shakespeare’s vocabulary: did it dwarf all the others?”, there are three different ways to cut up a text into its words. (They cite Marvin Spevack’s important studies into this issue, which were among the first to use a computer.)

Of the 884,647 tokens in the Riverside Shakespeare corpus, a computer counts 29,066 “types”—that is, different kinds of collections of letters. This machine-counting doesn’t account for the common alternate spellings of Shakespeare’s day, like wreck and wrack, or murder and murther, nor does it separate plurals and conjugated forms from their more common roots. Therefore, horse and horses are two different words, as are run and running.

That’s because computers—at least in the late 1960s, when Spevack was conducting his study—could only distinguish “types” like those. That horse and horses shared a root meant nothing to them. To count root words—which are sometimes called lemmas—the two scholars had to rely on hand-counts, or rely on the common estimate that a vocabulary not yet lemmatized is two-thirds larger than one that uses only root words.

What’s Shakespeare’s lemmatized vocabulary, then? Both long-respected hand-counted efforts and a mathematical estimation return the same answer: He used between 17,000 and 18,000 root words.

This count may still be incorrect. Spevack’s machine reading can’t account for homographs, words like spring or bear that can function as nouns or verbs and have many more definitions after that. It also doesn’t track two-token words, like grown up, where types combine to create a new definition. Finally—and this is the largest misestimation of all—it doesn’t account for words that Shakespeare knew but never wrote in a play. Such a challenge engrosses Elliott and Valenza for much of their paper. They conclude, finally, that Shakepeare’s total vocabulary… is just about the same size as or smaller than that of a “run-of-the-mill college-educated modern.”

Look at @everyword, and you can see that its 109,000 tweets aren’t lemmatized. In its Elysian Fields, a single…

gallops among…

and it’s not just doing it to…

but because it likes…

And the point Parrish wants to make about language is a little different, too. At @everyword’s current rate of one tweet every 30 minutes, no starter list can stay up-to-date for the years and years it would take to complete the English language. Language is much too protean for that.





More From The Atlantic

Advertisement