Adam Kilgarriff, Longman Dictionaries 

The Myth of Completeness and Some Problems with Consistency 
(The Role of Frequency in Deciding What Goes in the Dictionary)


Abstract
At the core of lexicography there is an ill-acknowledged,
subjective notion of importance.  Important words need fuller
treatment.  When people talk about consistency in dictionaries,
this wrinkle is often overlooked.  There is an analogous situation
in theoretical lexicology and NLP, where the lure of elegant, rule-
based theories has taken precedence, and 'importance' largely
ignored.  If lexicography is to become more objective, the concept
of importance must be closely studied with a view to finding an
objective surrogate for it.  This will come from corpora.  As yet
we have very little idea of which facts from which corpora are
relevant, but if we do not take on the challenge, lexicography is
forever doomed to ineffable subjectivity.


1. Introduction

As every scrabble-player and crossword enthusiast knows, the
dictionary is - or damn well ought to be - complete, listing
everything that is an English word (and nothing that isn't) (1). 
In the literature, one finds rather more considered voices
asserting that a dictionary, or dictionaries in general, are
incomplete in a particular area - or, more frequently, that they
are inconsistent, with words displaying similar behaviour receiving
different treatments, this said with shaking head and expression of
world-weary disillusion.  The argument of this paper is that,
firstly, completeness is a myth.  A dictionary can no more be
complete than incomplete; the concept simply does not apply.  And
secondly, claims of consistency or inconsistency are less
straightforward than they might seem, and are in general unproven
until grounded in corpus evidence.


2. The evidence 

A first indication of the problem is exposed by comparison of the
following two entries from LDOCE2 (Summers 1987): 

whisky 1 a strong alcoholic drink  2 a glass of whisky
bourbon  a type of whiskey

"Why", the advocate of completeness and consistency might say "is
the 'glass-of' sense of bourbon not presented? The dictionary is
incomplete. It is also inconsistent.  The two words can both be
used in both ways, so should receive the same treatment in the
dictionary."  ('Alternations' such as that between the 'drink'
sense of drink-words and the glass-of sense are discussed, under
various names by various authors, inter alia Apresjan (1973), Leech
(1981), Ostler and Atkins (1991), Kilgarriff (1992).) 

A second example: opportunity, privilege and indignity are all
nouns with interesting subcategorisation behaviour.  But whereas
opportunity has over 2000 entries in the Longman Lancaster corpus,
privilege has 300, and indignity just 51.  The dictionary user is
much more likely to encounter - or, for the non-native speaker, to
want to know how to use - the full range of subcategorisations with
opportunity than with privilege or indignity.  There is a strong
case for giving opportunity the fullest treatment, privilege an
intermediate one, and indignity, a summary one. 

Of course every lexicographer recognises the problem and
understands that, in short, more important words require fuller
treatment.  But every lexicographer also knows that there is
nothing black and white about what is an important word, and many
of the most difficult lexicographic judgements lie around questions
of  "is this sense (or multi-word unit or grammatical pattern)
sufficiently important to require its own treatment?"  Because
'importance' is a matter of degree, well-designed dictionaries have
a variety of strategies for indicating various shades of grey.  In
LDOCE2, a sense that is the result of an alternation is rolled into
the same sense as the primary meaning for the word (using a
bracketing convention) where the alternate meaning is not very
important.  Where it is more important, it gets its own sense. In
the Longman Language Activator, a more important multi-word unit is
treated as a distinct headword with a definition, whereas a less
important one is presented in bold, sometimes with a gloss, with an
example but without a definition.  In Kilgarriff (1992) I describe
eight formally distinct strategies used in LDOCE2 for indicating
different shades of grey.


3. Theoretical considerations: rule systems

Completeness is a well-defined term in the mathematics of
rule-systems.  A theory is complete if every true theorem can be
derived, from axioms and rules of inference.  Since Chomsky's
Aspects, the study of syntax has been virtually synonymous with the
study of a particular variety of rule systems, so the mathematics
of rule systems has been central to the field of study. 

The competence/performance distinction allows people working in
this tradition to view the data as evidence for a rule system of
one kind or another.  For the study of syntax, the
competence/performance distinction has been amply defended and the
paradigm of syntax-as-rule-system has been highly successful.  But
the distinction does need a lot of defending; it is an idealisation
which allows us to disregard some of the things people say which
might cause problems for our theory, simply by throwing them in the
rubbish bin of 'performance'. If the distinction is too easily
invoked, it undermines objectivity and makes theories
unfalsifiable.  For syntax, the distinction has proved its worth
(though see Sampson (1987) for arguments against).  For lexis, it
has not.

Rule systems are very attractive to researchers.  They allow us to
describe and encode a range of phenomena elegantly and concisely.
They are particularly enticing for NLP: a small core lexicon can
describe a large number of word senses if it includes rules which
multiply a single lexical entry out as a number of senses.  For
example, rule-based procedures can add in the 'glass-of' senses of
bourbon and whisky when the core lexicon contains only the 'liquid'
sense.  This approach, espoused in, for example, Pustejovsky (1991)
and Levin (1993), offers the prospect of greatly increased
coverage, efficiency savings, and theoretical elegance.  Does it
offer the prospect of a complete lexicon?

It would work like this.  Each lexical entry would have, in
addition to its core meaning, an account of its class membership
(or memberships).  For each class, the alternations (whereby, for
example, the 'liquid' sense gives rise to the 'glass-of' sense for
all 'drinks' words) are listed.  Provided we capture all the
class-memberships for all the words, and all the alternations that
apply to each class, the fully multiplied-out dictionary would be
complete.

The prospect is alluring.  But it is grounded in the
competence/performance distinction and a model of lexis as
rule-system.  When Morticia pours herself a hemlock from the Addams
Family bar, hemlock is operating in a 'glass-of' sense.  But it
would take a large amount of forethought on a lexicographer's part
to classify hemlock as a 'drink' word, in order that it might
participate in the alternation.  And if hemlock is to count as a
drink, what other unusual word-uses must we anticipate?  There is
a risk that we shall want to say any word might be used for
anything.  Anything goes.  Our lexicon is quite useless; in tells
us not only that horse can mean 'horse', but also that it can mean
'cow', 'cheese', 'lexicon' or anything else.  


3.1  Levin's English Verb Classes and Alternations

The rule-system approach will only work if both words and
alternations can be satisfactorily classified, and a satisfactory
method is found for identifying nonce cases, like Morticia's use of
hemlock, so they can be set aside for separate treatment.  One 
substantial piece of work that addresses two of these three issues
is Levin's English Verb Classes and Alternations (1993).  Levin has
classified over 3000 English verbs into 192 classes, and has
identified 80 syntactic alternations.  Associated with each class
is a list of alternations that can be applied, and a list of those
that cannot.  Levin's hypothesis is that a verb's meaning
determines its syntactic behaviour, so the alternations she
considers are those which relate two different syntactic frames for
a class of verbs, as in the relation between the following:

Martha carved a toy out of wood for the baby.
Martha carved the baby a toy out of wood.

Levin's work is a major contribution to our understanding of the
behaviour of English verbs, and is a resource for lexicography and
NLP alike.  We join her in believing it will "pave the way toward
the development of a theory of lexical knowledge" (p 1).  But work
of this kind requires a complement.  She is committed to the notion
of lexis as rule-system, and the competence/performance
distinction, arguing that

"native speakers can make extremely subtle judgments concerning the
occurrence of verbs with a wide range of possible combinations of
arguments and adjuncts in various syntactic expressions." (p 2)

Her discussion makes no mention of nonce cases, or the dubious
status of cases where native speakers are less than unanimous.  She
is addressing those areas of lexical semantic knowledge most
closely linked to syntax, and correspondingly, the methods and
assumptions used for studying syntax serve her well.  The only
alternations she addresses are those manifested in distinct
subcategorisation frames, so she has chosen a domain where the
identification and individuation of alternations is relatively
straightforward.  But a theory of lexical knowledge must cover also
those areas where alternations are not so easily identified, and
must make sense of the kline from standard cases to nonce cases and
one-offs. 


4. The Horns of the Dilemma
 
"There are two kinds of science: physics, and stamp collecting"
(Rutherford).  In this context, rule-systems bear the hallmark of
physics.  They allow for the concise statement of generalisations,
and have predictive power.  If they are inappropriate, must we be
stamp-collectors, merely recording lexical facts, generalisations
forbidden, immune to accusations of inconsistency because, with
generalisations renounced, there are never any grounds for
asserting consistency or inconsistency?  If all claims that two
words fall in the same class are suspect, debates concerning
whether two words ought to be treated similarly are of little
concern. 

The theoretical issues lead back to the lexicographical question.
There are rules and generalisations at play in lexis, and these
have a role in theoretical lexicology, practical lexicography and
NLP.  But their role is not untrammelled: it is constrained by the
elusive property of importance.  Because whisky is more important
than bourbon, it is reasonable for LDOCE2 to describe the glass-of
sense of the one but not the other.  Because 'drink' is a more
important classification for bourbon than for hemlock, it is
appropriate for an NLP lexicon to include the 'glass-of' sense
among the implicit, rule-derived senses of the former but not the
latter.  


5.  Enter the corpus

The advent of the corpus offers a promise of objectivity, and thus
theoretical well-foundedness.  In short, importance is frequency. 
But which frequency?  In what corpus?  The corpus has opened a
hornet's nest. 

The most obvious and pressing question is representativeness.  If
statements about general language are to be based on frequencies in
a particular corpus, then if that corpus is skewed, so are the
claims made on the basis of it.  For general language, it is far
from clear what larger population a sample should be representative
of.  These are urgent questions currently receiving a substantial
amount of attention (Biber 1993, Summers 1993).  But there are many
other questions which would remain even if the corpus were
representative.  

At Longman we have experimented with the idea that the number of
corpus lines a lexicographer should look at should be related to
the frequency of the word.  High frequency words tend to have more
meanings, be more 'important', and are worthy of more corpus study.
This seemed straightforward.  I gave the matter some thought,
determined that a logarithmic relationship was appropriate, and
drew up a table stating, for corpus frequency X, what sample size
Y should be.  But when I tried to apply it, it was rapidly thrown
back at me.  I had not taken the sophistication of how
lexicographers work with the corpus into account.  For most words,
a substantial proportion of their corpus lines can be accounted for
by a small number of collocates.  Thus half the corpus lines for
ranks are accounted for by a preceding close or closed. From the
point of view of not missing interesting corpus lines, it is the
lines not accounted for by close or closed which must be closely
studied: once the set expression is noted, there is no need for the
lexicographer to spend much longer looking at examples of it.  So,
for purposes of focusing the lexicographer's attention to maximum
advantage, X should be the frequency of ranks appearing without   
a preceding close or closed, and Y should be drawn from that
population.  But that population cannot be defined for the general
case, and even where it can be defined it cannot readily be
counted.  The moral of the story is that frequencies rarely have a
simple story to tell, and while lexicography desperately needs
them, they raise innumerable difficult practical and theoretical
questions. 


6. Conclusion

Completeness is not the sort of thing that applies to dictionaries. 
Scepticism is in order in relation to some accusations of
inconsistency.  Is LDOCE2 inconsistent between its treatment of
bourbon and whisky?  Only on a narrow-minded and, in the last
analysis, untenable account of consistency.  We need new models of
what it is for a dictionary to be consistent, which go beyond the
model offered by boolean rule-systems and take corpus evidence into
account.  We have as yet very little understanding of how corpus
frequencies might be integrated with rule-systems.  But at present,
at the foundations of lexicography we find the shifting sands of
'importance'.  Only as we clarify this notion, by finding its
objective correlates in the corpus, will the foundations be secure.

 
Notes

(1) Or any other language: my work has been on monolingual English
dictionaries, so here I apologise for speaking as if English were
the only language.


References

Apresjan, J. 1974. 'Regular Polysemy'. Linguistics 142: 5-32.

Biber, D. 1993. 'Using Register-Diversified Corpora for General
Language Studies.' Computational Linguistics 19(2): 219-242. 

Kilgarriff, A. 1992. Polysemy. Dphil Thesis, University of Sussex.

Leech, G. 1974.  Semantics. Penguin.

Levin, B. 1993. English Verb Classes and Alternations.  University
of Chicago Press.

Ostler, N. and Atkins, B. T. S. 'Predictable Meaning Shift: some
linguistic properties of lexical implication rules.' In J.
Pustejovsky and S. Bergler (eds.)  Lexical Semantics and Knowledge
Representation: ACL SIGLEX Workshop, Berkeley, California.

Pustejovsky, J. 1991. 'The Generative Lexicon.' Computational
Linguistics 17(4): 409-441.

Sampson, G. 1987.  'Evidence Against the Grammatical-Ungrammatical
divide.'

Summers, D. 1987. Longman Dictionary of Contemporary English, New
Edition. (LDOCE2)  Harlow: Longman.

Summers, D. 1993. 'Longman/Lancaster English Language Corpus -
Criteria and Design.' International Journal of Lexicography 6(3):
101-208.