How to gather and represent machine readable commonsense knowledge?

Commonsense resources provide machine-readable knowledge about the world. Resources are expected to be large-scale and accurate, consist of diverse knowledge types, and be usable in downstream tasks. ConceptNet is a large (21 million assertions), commonly-used resource consisting of general commonsense knowledge, in over 85 languages. ATOMIC consists of 880,000 triplets reasoning about causes and effects of everyday situations. Other resources are listed in Figure 8.

\ 640x293
Figure 8: Overview of existing commonsense resources. Image credit: Maarten Sap.

Existing resources differ in several aspects:

Representation : how is knowledge represented in the resource? ConceptNet and ATOMIC represent knowledge in natural language (Figure 9), while NELL and Cyc represent knowledge in symbolic logic:

(#$implies (#$and (#$isa ?OBJ ?SUBSET) (#$genls ?SUBSET ?SUPERSET)) (#$isa ?OBJ ?SUPERSET))

\ 640x350
Figure 9: example knowledge extracted from ConceptNet and ATOMIC. Image credit: Maarten Sap.

Knowledge type: ConceptNet consists of semantic knowledge, i.e. properties of concepts (e.g. reading is a type of activity ). ATOMIC, on the other hand, is inferential: given a templated event with “PersonX” representing the subject and “PersonY” an optional object(s) (e.g. PersonX yells at PersonY ), and one of 9 pre-defined relation dimensions (e.g. PersonX’s motivation) it provides a second event (e.g. PersonX wanted to express anger ).

Collection method: knowledge can be collected from humans, either experts or crowdsourcing workers. Expert-curated resources are more uniform and accurate, and may use complex representations, but it is an expensive collection method, and it is very time consuming. Alternatively, non-experts can write knowledge in natural language, making the collection faster and more scalable.

The alternative approach is to extract knowledge automatically from texts, as in NELL. This approach works, but it produces less accurate knowledge. In addition, the approach suffers from reporting bias: over-representing the rare at the expense of the trivial. For example, people are reported to murder more often than they are reported to breathe. Default properties of concepts ( yellow banana ) are mentioned less often than their alternatives ( green banana ), etc.