Is commonsense knowledge already captured by pre-trained language models?

In the last 3 years, language models have been ubiquitous in NLP. Language models are pre-trained once, in a self-supervised manner that requires only a large text corpus. Traditionally, language models are trained to predict the next word in a sentence (top part of Figure 2, in blue), but they can also predict hidden (masked) words in the middle of the sentence, as in Google’s BERT model (top part of Figure 2, in orange). This pre-training phase yields a function that gets a sequence of words (sentence, short paragraph) and returns a vector for each word in the sequence.

\ 400x184
Figure 2: Language models pre-training and fine-tuning.

As opposed to word embeddings which are static, language model-based word vectors are dynamic and re-computed for each context. At the very basic level, they assign different vectors to words when they are used in different senses, as in Figure 3.

\ 400x145
Figure 3: Static vs. dynamic word representations.

Do off-the-shelf pre-trained language models already capture commonsense knowledge?

:white_check_mark: They are capable to some extent, of filling incomplete commonsense facts or ranking candidate facts. For example, the language model score (≈ statement plausibility) of a fact like “a musician plays a musical instrument” is higher than “a dancer plays a musical instrument”. This is a proof that, in addition to lexical and syntactic knowledge, language models capture general knowledge about the world.

:white_check_mark: They can, to some extent, associate concepts with their properties. They distinguish concepts associated with a given set of properties, i.e. complete a statement such as " A has fur, is big, and has claws, has teeth, is an animal, … " with bear (just like playing the “20 question game”). They perform better when they are shown encyclopedic properties (e.g. is an animal ) as opposed to perceptual properties (e.g. smooth ). They can also, pretty successfully, list the properties associated with given concepts, e.g. complete the sentence "Everyone knows that a bear has " with fur, claws, teeth, etc.

However, knowledge generated from language models is noisy!

:no_entry_sign: Several papers have shown that language models are not sensitive to negation, i.e. they consider the negated version of facts (" birds can’t fly ") as similarly plausible.

:no_entry_sign: They are sensitive to phrasing:

:no_entry_sign: In distributional word vectors, the vector representing a (sub-)word is learned from the contexts in which it appeared, leading to similar representation for semantically-similar words. In language models, the representation of similar contexts is similar, so the model learns which type of word should appear next (or instead of a masked token). This is generally a positive thing, but it sometimes over-generalizes, leading to examples such as this:


Figure 4: BERT guesses that the masked token should be a color, but fails to predict the correct color. Using the AllenNLP demo.

Here, BERT has seen in its training corpus enough sentences of the type “The color of something is [color]” to know to suggest different colors as substitutes for the masked word. Unfortunately, not every color is suitable in every context that calls for a color. BERT likely didn’t see enough sentences discussing the color of a dove, thus it defaults to just predicting any color.

So knowledge in language models is not the most accurate and reliable. Is it still useful?

Yes, to some extent. One way to show it is through evaluation on tasks requiring commonsense knowledge. We will discuss several such tasks, but for now let’s focus on WinoGrande as an example. It is the large-scale version of the Winograd Schema Challenge. Given a sentence with a cloze, the goal is to fill in the blank with a previously mentioned entity or concept, out of two answer choices. For example:

Because Brett found an internship while in college but Ian was unable to, _____ found a job less quickly after graduation.

Choices: Brett, Ian

What makes this task especially difficult is that every instance has a twin sentence which is minimally changed such that the correct answer is the other one (for instance, replacing “less quickly” with “more quickly” will change the correct answer from Ian to Brett).

Language model-based models top the leaderboards of WinoGrande and other commonsense tasks, but since they are trained on task specific training data, which often contains tens or hundreds of thousands of training examples, it’s hard to attribute the success to the knowledge captured in language models from the pre-training step. A better way to estimate it is with zero-shot (unsupervised) models. Typically, the way zero-shot models address multiple-choice tasks is by phrasing a statement from the instance and each answer choice, and computing the language model score as a proxy for plausibility:

PLM(The answer is answer1)

PLM(The answer is answer2)

PLM(The answer is answerk)

And then predicting the answer choice with the best language model score (highest probability, which is usually computed as the lowest perplexity).