Comparing word frequencies across corpora:
Why chi-square doesn't work, and an improved LOB-Brown comparison.
Adam Kilgarriff
ITRI
University of Brighton
1. Introduction
We are often interested in discovering which words are markedly
different in their distribution between two texts or two corpora. In
this paper I show that one statistic which has sometimes been used for
this purpose, chi-square, is inappropriate. I present an alternative,
the Mann-Whitney ranks test. I apply the test to finding the words
which are most different between the LOB and Brown corpora and show
that it produces output that is well suited to the interests of
lexicographers and humanities scholars.
2. A simple framework
For two texts, which words best characterise their
differences? For word w in texts X and Y, this might be
represented in a contingency table as follows:
------------------------------------------
| | X | Y | |
------------------------------------------
| w | a | b | a+b |
| not w | c | d | c+d |
------------------------------------------
| | a+c | b+d | a+b+c+d=N |
------------------------------------------
There are a occurrences of w in text X (which contains a+c
words) and b in Y (which has b+d words).
3. The chi-square test
We now need to relate our question to a hypothesis we can test. The
obvious candidate is the null hypothesis that both texts comprise
words drawn randomly from some larger population; for a contingency
table of dimensions m x n, if the null hypothesis is true, the
statistic
(O-E)^2 (for `^2' read `squared')
SUM( ------- )
E
(where O is the observed value, E is the expected value calculated on
the basis of the joint corpus, and the sum is over the cells of the
contingency table) will be chi-square-distributed with (m-1)x(n-1)
degrees of freedom (provided all expected values are over a threshold
of 5.) For our 2x2 contingency table the statistic has one degree of
freedom and we apply Yates correction, subtracting 1/2 from O-E
before squaring. Wherever the statistic is greater than the critical
value of 7.88, we conclude with 99.5% confidence that, in terms of the
word we are looking at, X and Y are not random samples of the same
larger population.
This is the strategy adopted by Hofland and Johansson (1982) to
identify where words are more common in British than American English
or vice versa. X was the LOB corpus, Y was the Brown, and, in the
table where they make the comparison, the chi-square value for each
word is given, with the values marked where they exceeded critical
values (at any of three levels of significance) so one might infer that the
LOB-Brown difference was non-random.
Looking at the LOB-Brown comparison, we find that this is true for very
many words, and for almost all very common words. Most of the time,
the null hypothesis is defeated. Does this show that all those words
have systematically different patterns of usage in British and American
English?
To test this, I took two corpora which were indisputably of the same
language type: each was a random subset of the written part of the
British National Corpus (BNC). The sampling was as follows: all texts
shorter than 20,000 words were excluded. This left 820 texts, for
each of which a frequency list for the first 20,000 running words was
generated. Half the lists were then randomly assigned to each of two
subcorpora. Frequency lists for each subcorpus were generated. For each
word occurring in either subcorpus, the (O-E-0.5)^2/E term which would
have contributed to a chi-square calculation was determined. If the
two corpora were random samples of words --not texts-- drawn from the
same population, the error term would not vary systematically with the
frequency of the word, and the average error term would be 0.5. In
fact, as the table shows, average values for the error term are far
greater than that, and tend to increase as word frequency increases.
===============================================================
Class First item in class Mean error term
(Words in freq. order) Word POS for items in
class
---------------------------------------------------------------
First 10 items the DET 18.76
Next 10 items for PREP 17.45
Next 20 items not NOT 14.39
Next 40 items have V-BASE 10.71
Next 80 items also ADV 7.03
Next 160 items know V-INF 6.40
Next 320 items six CARD 5.30
Next 640 items finally ADV 6.71
Next 1280 items plants N-PL 6.05
Next 2560 items pocket N-SING 5.82
Next 5120 items represent V-BASE 4.53
Next 10240 items peking PROPER 3.07
Next 20480 items fondly ADV 1.87
Next 40960 items chandelier N-SING 1.15
===============================================================
As the averages indicate, the error term is very often greater than
0.5 x 7.88 = 3.94, the relevant critical value of the chi-square
statistic. As in the LOB-Brown comparison, for very many words,
including most common words, the null hypothesis is defeated.
This reveals a bald, obvious fact about language. Words are not
selected at random. There is no a priori reason to expect them
to behave as if they had been, and indeed they do not. The LOB-Brown
differences cannot in general be interpreted as British-American
differences: it is in the nature of language that any two collections
of texts, covering a wide range of registers (and comprising, say,
less than a thousand samples of over a thousand words each) will show
such differences. While it might seem plausible that oddities would
in some way balance out to give a population that was
indistinguishable from one where the individual words (as opposed to
the texts) had been randomly selected, this turns out not to be the
case.
Let us look closer at why this occurs. A key word in the last
paragraph is `indistinguishable'. In hypothesis testing, the
objective is generally to see if the population can be distinguished
from one that has been randomly generated --or, in our case, to see if
the two populations are distinguishable from two populations which
have been randomly generated on the basis of the frequencies in the
joint corpus. Since words in a text are not random, we know that our
corpora are not randomly generated. The only question, then, is
whether there is enough evidence to say that they are not, with
confidence. In general, where a word is more common, there is more
evidence. This is why a higher proportion of common words than of
rare ones defeat the null hypothesis.
The original question was not about which words are random but about
which words are most distinctive. It might seem that these are
converses, and that the words with the highest values for the
chi-square statistic -- those for which the null hypothesis is most
soundly defeated -- will also be the ones which are most distinctive
to one corpus or the other. Where the overall frequency for a word in
the joint corpus is held constant, this is valid, but as we have seen,
for very common words, high chi-square values are associated with the
sheer quantity of evidence and are not necessarily associated with a
pre-theoretical notion of distinctiveness.
4. Burstiness
As Church and Gale (1995) say, words come in bursts; unlike lightning,
they often strike twice. Where a word occurs once in a text, you are
substantially more likely to see it again than if it had not occurred
once. A single document containing w is relatively likely to contain
a 'burst' of w's, so whichever corpus contains that document, will
contain more w's than is compatible with the null hypothesis. We
require a test which does not give undue weight to single documents
with a high count for w.
A test meeting this criterion is the Mann-Whitney (also known as
Wilcoxon) ranks test (1). To perform this test, we use frequency of
occurrence to rank the data, and then use ranks rather than frequency
to compute the statistic. The test proceeds as follows. The corpora
to be compared are each divided into a number of equal-sized parts
(for purposes of illustration, we use five). Suppose the frequencies
for X are
12 24 15 18 88
and for Y are
3 3 13 27 33
As the subcorpora that these frequencies are based on are all of the
same size, the figures are directly comparable. They are now placed
in rank order, a record being kept of the corpus they come from:
Count: 3 3 12 13 15 18 24 27 33 88
Corpus: Y Y X Y X X X Y Y X
Rank: 1 2 3 4 5 6 7 8 9 10
The ranks associated with the corpus with the smaller number of
samples (or either, where, as here, there are equal numbers for each)
are summed: for Y, 1+2+4+8+9=24. This sum is compared with the value
that would be expected, on the basis of the null hypothesis. These
values are tabulated (at various significance levels) in statistics
textbooks. If the null hypothesis were true, 95% of the time the
statistic would be in the range 18.37--36.63: 24 is within this range,
so there is no evidence against the null hypothesis.
A complication arises where two samples have the same number of hits
so cannot be straightforwardly ranked. Recommended practice here
is to, first, give all X's higher ranks, and then repeat giving all Y's
higher ranks. If the two methods give different conclusions, the
test is not applicable.
5. LOB-Brown comparison
The LOB and Brown both contain 2,000-word-long texts, so the numbers
of occurrences of a word are directly comparable across all samples in
both corpora. Had all 500 texts from each of LOB and Brown been used
as distinct samples for the purposes of the ranks test, most counts
would have been zero for all but very common words and the test would
have been inapplicable. To make it applicable, it was necessary to
agglomerate texts into larger samples. Ten samples for each corpus
were used, each sample comprising 50 texts and 100,000 words. Texts
were randomly assigned to one of these samples (and the experiment was
repeated ten times, to give different random assignments, and the
results averaged.) Following some experimentation, it transpired that
most words with a frequency of 30 or more in the joint LOB and Brown
had few enough zeroes for the test to be applicable, so tests were
carried out for just those words, 5,733 in number.
The results were as follows. For 3,418 of the words, the null
hypothesis was defeated (at a 97.5% significance level). In corpus
statistics, this sort of result is not surprising. Few words comply
with the null hypothesis, but then the null hypothesis has little
appeal: there is no a priori reason to expect any word to have exactly
the same frequency of occurrence on both sides of the Atlantic. We are
not in fact concerned with whether the null hypothesis holds: rather,
we are interested in the words that are furthest from it. The minimum
and maximum possible values for the statistic were 55 and 155, with a
mean of 105, and we define a threshold for 'significantly British'
(sB) of 75, and for 'significantly American' (sA), of 135.
The distribution curve was 'bell-shaped', one tail being sA and the
other sB. There were 216 sB words and 288 sA words. They showed the
same spread of frequencies as the whole population: the inter-quartile
range for joint frequencies for the whole population was 44--147; for
the sA it was 49--141 and for sB, 58--328. In contrast to the
chi-square test, frequency-related distortion had been avoided.
The sA and sB words were classified as follows:
----------------------------------------------------------------------
Code Mnemonic Example sA sB
----------------------------------------------------------------------
s Spelling color/colour; realise/realize 30 23
e Equivalent toward/towards; flat/apartment 15 17
n Name los, san, united; london, africa, alan 45 24
c Cultural negro, baseball, jazz; royal, chap, tea 38 26
? Unclear e, m, w ... (to be investigated) 6 10
o Other 154 116
----------------------------------------------------------------------
Totals 288 216
----------------------------------------------------------------------
The items with distinct spellings occupied the extreme tails of the
distribution. All other items were well distributed.
The first four categories serve as checks: if we had not identified
the items in these classes as sA and sB, then our method would not
have been working. It is the items in the 'others' category which are
interesting. The three highest-scoring sA 'others' are 'entire', 'several' and
'location'. None of these are identified as particularly American (or
as having any particularly American uses) in any of four 1995
Learners' dictionaries of English (LDOCE3, OALDCE5, CIDE, COBUILD2)
all of which claim to cover both varieties of the language. Of
course it does not follow from the frequency difference that there is
a semantic or other difference that a dictionary should mention, but
the `others' list does provide a list of words for which
lexicographers might want to examine whether there is some such
difference.
Notes
(1) A survey of other statistics which have been used for this purpose
is available in Kilgarriff (1996).
Acknowledgements
This work is supported by the EPSRC, Grant K18931, SEAL. The idea of
using the Mann-Whitney test emerged from a discussion with Ted
Dunning and Mark Lauer.
References
Kenneth Church and William Gale. Poisson Mixtures. Journal of
Natural Language Engineering, 1(2):163--190.
Knut Hofland and Stig Johansson 1982. Word Frequencies in British and
American English. The Norwegian Computing Centre for the Humanities,
Bergen, Norway.
Adam Kilgarriff. 1996. Which words are particularly characteristic of a text?
A survey of statistical approaches. In: Language
Engineering for Document Analysis and Recognition. Proceedings,
AISB Workshop, Falmer, Sussex.