Web as corpus ============= The corpus resource for the 1990s was the BNC. Conceived in the 80s, completed in the mid 90s, it was hugely innovative and opened up myriad new research avenues for comparing different text types, sociolinguistics, empirical NLP, language teaching and lexicography. But now the web is with us, giving access to collossal quantities of text, of any number of varieties, at the click of a button, for free. While the BNC and other fixed corpora remain of huge value, it is the web that presents the most provocative questions about the nature of language. It also presents a convenient tool for handling and examining text. Compared to LOB, the BNC is an anarchic object, containing 'texts' from 25 to 250,000 words long, screeds of painfully formulaic entries from the Dictionary of National Biography, conversations monosyllabic and incoherent, sermons, pornography and the electonic discourse of the Leeds United Football Club Fan Club. Compared to the web, the BNC is an English country garden. Whatever perversities the BNC has, the web has in spades. First, not all documents contain text, and many of those that do are not only text. Second, it changes all the time. Third, like Borges's Library of Babel, it contains duplicates, near duplicates, documents pointing to duplicates that may not be there, and documents that claim to be duplicates but are not. Next, the language has to be identified (and documents may contain mixes of language). Then comes the question of text type: to gain any perspective on the language we have at our disposal in the web, we must classify some of the millions of web pages, and we shall never do so manually, so corpus linguists, and also web search engines, need ways of telling what sort of text a document contains: chat or hate-mail; learned article or bus timetable. These may sound like arguments for *not* studying the web: for scientific progress, we need to fix certain parameters so we can isolate the features we want to look at, and the web is not a good environment for that. This is true. For the web to be useful for language study, we must address its anarchy. If the web is a torrent and nothing more, it is not useful; for it to be useful, we must channel off manageable quantities to irrigate the pastures of scientific and technological progress. The D3CI ======== We propose a "Distributed Data Distributed Collection Initiative" (D3CI), a framework for distributed corpora. This will be a set of corpora contributed by anyone with an on-line corpus to offer, where each corpus comes in the form of a set of URLs. The "virtual multicorpus" website will then be a place to visit for anyone wishing to download a corpus of some known language-variety. Corpus measures (Kilgarriff 2001) can be used to identify the homogeneity of each submitted corpus. A spidering program can collect one of the corpora and deliver it to a user. Our medium-term goal is to set up a suite of web-based corpora that can be used by linguists and language technologists to answer questions of the form: "my theory/algorithm/program works well on the text type I developed it for: I wonder how well it generalises to other text types." The use of the web addresses the hobgoblin of corpus builders: copyright. If material is on the web, it has been published and can be downloaded without infringing copyright. If I wished to store that material, put it on a CD and distribute that CD, I would be infringing copyright. If I merely present a list of URLs and announce to the world that this URL set comprises a corpus (of a given text type which I also describe) then I am not infringing copyright.\footnote{This does not necessarily hold, as was established in the NAPSTER case, where NAPSTER was taken to court for providing pointers to sites for downloading music files. But that related to the music publishing industry. However their case is unlike ours. Firstly, there, the copyright-owners did not in general wish the material to be freely web-accessible. For the text documents we are concerned with, this will not generally be the case and where it becomes apparent that the copyright-owner objects, documents can be deleted from corpora. Secondly, there, a sizeable part of the wealth of the music industry was at stake. In our case, there will not be the financial motivation to produce the kind of legal anomaly that the NAPSTER case represents.} There are also no administrative, CD-burning or postage costs associated with web-based corpora. To the objection that web pages die, so a corpus defined as a set of URLs would be forever shrinking, we propose the following solution. Our virtual corpora are monitored by an agent, which periodically checks that all URLs are still live. On discovering that one no longer is, the agent, which has gathered a statistical profile of each of the pages, sets out to find a new page or pages to replace the deceased. First, it submits a web search, using the terms in the deceased as search terms. This gathers in a set of candidates. Then, using text similarity measures, it identifies which of the candidates do in fact have the same linguistic characteristics as the deceased. It then adds them to the corpus. The virtual corpus will evolve. Some may object, "but that is not suitable as use as a corpus because the texts that are there today are not identical to those that were there yesterday, so how can we compare results?" Results can be compared because the text type is the same. To demand more is to demand that tomorrow's experiments on the water flow in the River Lune involve the same water molecules as yesterday's. Related Work ============ We are not the first to note the web's usefulness for corpus research, despite its short history. Since the mid-nineties, the net has commonly been used by summarisation researchers as a source of documents to summarise. In this context, Radev and McKeown (1997) use internet-accessible newswire as a knowledge source for a language generation system. More recently, researchers have used collections of papers found on the web for a very wide range of purposes. Grefenstette and Nioche (2000) and Jones and Ghani (2000) explore the potential of the web as a source of language corpora for languages where electronic resources are in short supply, and Resnik (1999), as a source for bilingual parallel corpora. Fujii and Ishikawa (2000) use the web to generate encyclopedia entries. Grefenstette (1999) presents prospects and experiments regarding the web as a source of lexical information; as the web provides thousands of contextualised instances of even fairly rare words, for many languages, it offers vast opportunities for automatic distillation of lexical entries from empirical evidence. Varantola (2000) pursues a similar theme, showing how translators, when confronted with a rare term, can find ample evidence of the term, its contexts, and associated vocabulary, through the simple use of a search engine. Specialised `lexicographic' search engines have been produced (see http://www.webcorp.org.uk ) though their relative merits compared to, eg, google (which provides some linguistic context for each occurrence of the word, all at breathtaking speed) remains an open question. Mihalcea and Moldovan (1999) and Agirre and Martinez (2000) use the web as a lexical resource, and as a source of test data, for Word Sense Disambiguation. Jacquemin and Bush (2000) use it as a source for harvesting lists of named entities. There has recently been a Web track in the TREC Information Retrieval Competition (see http://pastime.anu.edu.au/WAR/webtrax.html). A field such as this, with its newness and no entry costs, is immediately appealing to students and others, and the list above is of course incomplete. It does indicate how the use of the web as a corpus is taking off fast. Conclusion ========== To conclude: the BNC was one of the greatest innovations for lingustics in the 1990s. Now the world has moved on. As corpus linguists, we are in the fortunate position of having a particular perspective and channel of attack for examining the web --perhaps the most extraordinary phenomenon of our time -- which also just happens to provides solutions to many of our practical problems and an endless stream of new data. We have presented a model which uses the web as a source of data, the web as a delivery medium, and in which the web, and its language, are objects to be explored. The corpus of the new millennium is the web. References ========== Agirre, Eneko and David Martinez. Exploring automatic word sense disambiguation with decision lists and the web. In Proc. COLING Workshop on Semantic Annotation and Intelligent Content, saarbruecken, Germany. August 2000. Fujii, Atsushi and Tetsuya Ishikawa. Utilizing the world wide web as an encyclopedia: Extracting term descriptions from semi-structured text. Proc. 38th Meeting of the ACL, Hong Kong, October 2000. Pages 488-495. Grefenstette, Gregory. The WWW as a Resource for Example-Based MT Tasks. Invited Talk, ASLIB `Translating and the Computer' conference, London. October 1999. Grefenstette, Gregory and Julien Nioche. Estimation of English and non-English Language Use on the WWW. Proc. RIAO (Recherche d'Informations Assistee par Ordinateur), Paris, 2000. Jacquemin, Christian and Caroline Bush. Combining Lexical and Formatting Clues for named entity acquisition from the web. Proc.Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora. Hong Kong. October 2000. Pages 181-189. Jones, Rosie and Rayid Ghani. Automatically building a corpus for a minority language from the web. 38th Meeting of the ACL, Proceedings of the Student Research Workshop. Hong Kong. October 2000. Pages 29-36. Kilgarriff 2001. Comparing Corpora. Int Jnl of Corpus Linguistics 6 (1). Pages 1-37. Rada Mihalcea and Dan Moldovan. A method for word sense disambiguation of unrestricted text. Proc.\ 37th Meeting of ACL. Maryland, USA, June 1999. Pages 152-158. Radev, Dragomir and Kathleen McKeown. Building a generation knowledge source using internet-accessible newswire. Proc.\ Fifth Applied Natural Language Processing. Washingotn D. C.., April 1997. Pages 221-228. Resnik, Philip. Mining the web for bilingual text Proc.\ 37th Meeting of ACL. Maryland, USA, June 1999. Pages 527-534. Varantola, Krista. Translators and disposable corpora. Proc. CULT (Corpus Use and Learning to Translate). Bertinoro, Italy. November 2000.