Douglas Biber. Dimensions of Register Variation: A cross-linguistic comparison. Cambridge University Press, 1995 Review by Adam Kilgarriff This book carries forward the research program described in Biber (1988), taking the methodology described there forward into cross-linguistic and diachronic territories. The review will address the research program as a whole: the whole is covered thoroughly in this book, though various parts have already been described in the 1988 book and elsewhere. As the title implies, Biber's goal is to chart the various ways in which language varies. This is a large goal, and might seem too broad to make for a coherent book. But Biber brings to the question a methodology which both gives coherence to the enterprise, and which makes a number of major advances on earlier work of this kind. In this review, I first describe the methodology and present a thumbnail version of the analysis for English, to give a flavour of what it produces; then, argue why researchers in language engineering should be interested; then, outline the structure of the book, and finally, comment on the book's results and conclusions. 1. The Multi-Dimensional (MD) Methodology * Gather a set of text samples to cover a wide range of language varieties; * Enter them (``the corpus'') into the computer; * Identify a set of linguistic features which are likely to serve as discriminators for different varieties; * Count the number of occurrences of each linguistic feature in each text sample; * Perform a factor analysis (a statistical procedure) to identify which linguistic features tend to co-occur in texts. The output is a set of ``dimensions'', each of which carry a weighting for each of the linguistic features. * Interpret each dimension, to identify what linguistic features, and what corresponding communicative functions, high-positive and high-negative values on the dimension correspond to. For English, Biber identifies seven dimensions, numbered in decreasing order of significance (so dimension 1 accounts for the largest part of the non-randomness of the data, dimension 2, the next largest, etc.) The first he calls ``Involved versus Informational Production''. Texts getting high positive scores are typically spoken, typically conversations. Texts getting high negative scores are academic prose and official documents. The linguistic features with the highest positive weightings are ``private'' verbs ({\em assume, believe} etc.), {\em that}-deletion, contractions, present tense verbs, and second person pronouns. The linguistic features with the highest negative weightings are nouns, word length, prepositions, and type-token ratio. (There was a total of 67 linguistic features for English.) Dimensions two was ``Narrative versus Non-narrative Discourse'', for which past-tense verbs and third-person pronouns are the high-positive features, and fiction texts got the high positive scores, and Dimension three was ``Situation-dependent versus Elaborated Reference'': broadcasts are the highest-scoring texts, and official documents the lowest-scoring, with the positive linguistic features being time and place adverbials, and the the negatives, relative clauses, phrasal co-ordination, and nominalisations. Any text can be given a score for any dimension, by counting the numbers of occurrences of the linguistic features in the text, weighting, and summing. 2. Why is it important for language engineering? Computational linguistics has often proceeded as if there were no language variation. A grammatical phenomenon is described, its implications for parsing discussed, and the question ``in what sort of language does it occur'' does not arise. This is all very well. But when people start wanting to use computational linguistic for language engineering applications, the engineers will want to know ``do we have to deal with this one - or does it not occur in our type of language?'' So far the point is not an original one and the answer is simple: you need to look at your corpus. But will the engineer need to ask the same question for each of a few hundred grammatical and discourse constructions and a few thousand anomalies in the lexicon, which a wide-coverage grammar and lexicon might indicate? They might well need to do a great deal of weeding to get the system to a manageable size, or to get a parser or interpreter to respond fast enough, or to keep the set of parses being returned to a manageable number. The engineer needs an overview of how the application sublanguage relates to the whole range of the language. The small amount of sublanguage work by computational linguists (Kittredge & Lehrberger, 1982; Grishman & Kittredge, 1986) is widely cited, and provides useful guidelines and case studies, but does not provide a method for locating a particular sublanguage within the whole range of varieties of a language. Nor, pre-Biber, does the sociolinguistic literature. Biber reviews this literature, and identifies desiderata for an objective study of language variation. It should: * be corpus-based (for familiar reasons of objectivity and the possibility of analysing large corpora economically and consistently); * consider the full range of registers for the language; * look at a wide range of linguistic features; * address interactions between linguistic features, and between linguistic features and communicative functions, without assuming - the relationships are all-or-nothing - there is just one dimension of variation underpinning an observed pattern; * bring together quantitative methods (for identifying patterns of variation) and qualitative ones (for interpreting them). No previous work meets more than a couple of these desiderata. Biber's and colleagues' work resoundingly meets them all. For language engineering and for all other branches of language study concerned with variation, Biber provides an approach to describing a language variety which takes the full picture into account. 3. Structure of the book The book's novel contribution is, as implied by the subtitle, the use of MD analysis for cross-linguistic comparisons. MD analyses are conducted for four languages, each with very different histories and societal circumstances, and the four analyses are then compared and contrasted. The four languages are English, Somali (which has the particular characteristic of only having been a written language since 1973), Korean, and Nukulaelae Tuvaluan, a language spoken by 300-350 people on a group of Pacific islands. (For the latter three, Biber builds on analyses by other authors.) Several of the chapters have a four-part structure, with one stage of the MD process being described for each language in turn. Chapters 1 and 2 present the case for the MD approach, including a literature survey. Chapter 3 introduces the sociocultural situation of each language. Chapter 4 presents an extensive argument concerning the difficulties of cross-linguistic studies of variation. Looking closely at English and Somali, Biber considers how apparently equivalent linguistic features in the two languages may, or may not, play the same role within the whole grammatical system of the language; may, or may not, have equivalent communicative functions; and may, or may not, occur frequently in apparently equivalent registers. Neither linguistic features nor registers can be straightforwardly matched across languages. There are also practical difficulties. Somali words are more morphologically complex than English, and Somali versions of texts tend to contain 25% fewer words than their English equivalents, so measures of, say, first-person pronouns per thousand words, cannot be directly compared for the two languages. The conclusion of the argument is that, before cross-linguistic comparisons can be undertaken, a more sophisticated description of each language in terms of its dimensions of variation is required. Chapter 5 describes the methodology, which, but for the cross-linguistic slant, will be familiar territory to those who have read Biber (1988). It also describes the registers used to build the corpus, and the set of linguistic features, for each language. Chapter 6 describes and interprets the results of the MD analysis for each language, in rather more detail than any but the most assiduous scholar of the language could wish for. Chapters 7 and 8 cover the synchronic and diachronic cross-linguistic comparisons and are described below. Chapter 9 discusses another construct made available by MD analysis, the ``text type''. Whereas registers are defined by the situation in which a text occurs and its communicative function(s), text types are defined exclusively by the clustering of linguistic features. The chapter describes the text types of English and Somali, and discusses the matches and mismatches between text types and registers, within and between the two languages. Chapter 10 presents avenues for further research, including the application of MD analysis to sublanguage analysis for NLP. 4. Findings The synchronic cross-linguistic comparisons involve, for the most part, a comparison of the dimensions identified for each language. Each dimension had been interpreted in terms of the communicative functions it serves, so the task is now to identify whether, or where, there were dimensions with the same functions in different languages. Some dimensions, such as the ``Abstract style'' dimension in English (marked by passives, the hallmark of scientific articles -- a register particularly well-developed in English), or the ``Honorification'' dimension in Korean, are language-specific. Others had equivalents (in one-to-one, one-to-many, or many-to-many relationships) in other languages. Foremost among these were the oral/literate dimensions. Although the study, along with Biber (1988), establishes that there are no dichotomous relationships between speech and writing, aspects of the distinction are evident in three or more of the highest-rated dimensions for all four languages. Biber justifies his treatment of conversation and `informational exposition' as the two poles of the ``oral/literate'' axis on the grounds that each ``maximally exploits the resources of its mode'' (p. 288). Biber goes on to break the oral/literate dimensions down into four components: interactiveness (as marked by, eg, first and second person pronouns), production circumstances, stance (hedges, downtoners, emphatics) and language-specific aspects. For the first three, he finds corresponding dimensions in each of the four languages. He finds two other equivalences between dimensions across languages. In all four languages there is a dimension closely associated with narration, and in English and Somali, there are dimensions associated with argumentation and persuasion. He also uses the MD analysis to sketch similarities and differences between corresponding genres across languages, finding, for instance, that Somali conversation tend to be overtly argumentative to an extent that English ones do not, and that letters tend not to be structurally elaborated in any of the languages. The diachronic MD analyses are for English and Somali, and lead to some fascinating observations about how different varieties of language change with the advent of written registers. The analysis for English (with Finegan) covers 350 years, since the advent of a substantial range of written registers of English in the seventeenth century, whereas that for Somali covers the twenty-six years from 1973 to 1989, yet Biber finds a similar pattern in both. In the first period following the instigation of written genres, all written genres become more ``literate'', with (for English) more nouns, prepositions phrasal and relative clauses, and longer sentences and higher type-token ratios. But then, after a century (for English) or five years (for Somali), the ways diverge: ``professional'' registers such as legal and medical prose continue to become more ``literate'', but popular prose, in fiction, newspapers and letters, become more ``oral''. Biber relates the first period to the exploration of the potential of writing and the establishment of distinct written registers, and the bifurcation to the advent of mass literacy, and a much broader market for written material. While so many factors are hugely different between the languages (including, for instance, the level of social and economic instability in Somalia in the 1980s forcing many journalists to take up several jobs -- leaving them little time for the careful crafting of their prose!) there is an enticing glimpse of what could be general processes. 5. Final comments The MD research program has been proceeding for over a decade, but as yet the methodology has only been used by Biber and a small group of collaborators. This could be because other readers have not been impressed by the work, or it could just be that the methodology is technically difficult and time-consuming to implement. I suspect the latter. Each stage of the methodology involves a new set of obstacles and skills, and an MD analysis is not lightly undertaken. But maybe that just serves to remind us that language variation is rarely if ever a simple matter, and unearthing the multiple interactions that go into defining a sublanguage or language variety will inevitably require sophisticated machinery. At present, this is the only such machinery there is. References Biber, D. (1988). Variation Across Speech and Writing. Cambridge: CUP. Grishman, R. & Kittredge, R. (eds.) (1986). Analyzing Language in Restricted Domains: Sublanguage Description and Processing. Hillsdale, NJ: Lawrence Erlbaum. Kittredge, R. & Lehrberger, J. (eds.) (1982). Sublanguage: Studies of Language in Restricted Semantic Domains. Berlin: De Gruyter.