Don't be a dictionary dentist ============================= Dentists show a marked tendency to judge people in terms of their teeth. This quirk is full of charm in its place yet it does run the risk of giving only summary attention to other aspects of character. Computer scientists show a similar inclination when looking at dictionaries. Formalism, mark-up, inheritance relations, maybe syntax codes they understand, so it is these aspects they consider when considering dictionaries. Aspects further from their expertise tend to be ignored. Consider the following advertisement for a post on a computational lexicography project: Requirements: - university degree in computational linguistics or computer science; - either Italian or English as a mother language and fluency in the other language; - programming skills in at least one of the following language: Lisp, Prolog, C, Tcl/Tk, Java, Perl. The following will be considered a plus: - strong background in lexical semantics; - good knowledge of internet tools (HTML, CGI, etc.). The advertisement was evidently written by someone who knows more about computation than lexicography. Relevant computational skills are spelled out, whereas native-speaker competence is apparently adequate for the lexicographic half. Literature degrees involve close examination of the use of language but are effectively a disqualification, as literature degree holders will rarely be computer degree holders. "Strong background in lexical semantics considered a plus" is an attempted remedy, but lexical semantics is an obscure academic sub-discipline and no mention is made of lexicography or translating or interpreting or writing, all of which give intimate experience of the conundrums of word meaning. It is now widely acknowledged that Language Engineering (LE) needs lexicons, and the EU has supported a great deal of lexicon development. This is good, but the focus on producing lexicons has not been matched by attention to their quality. Semantics and the rest ====================== Lexicographers do not generally find the aspects of their job other than the analysis and description of meaning particularly challenging. In a recent survey, lexicographers classified as "hardest" those aspects of their job concerned with meaning, with syntax and morphology coming well down the list. The computer scientists might say to all this, "Why should we care about analysis of meaning when our programs cannot use subtle semantic information? We can only use the formalised items -- syntax and domain codes, synonymy and hierarchy relations -- so we don't need to care about what you call 'quality'". This is firstly wrong -- because flawed analysis will introduce garbage into the lexical relations, or mappings between syntactic elements and semantic roles -- and secondly shortsighted. Dictionaries are being explored and lexical databases developed now for use for many years to come: over the years, LE systems will come to exploit more and more of the semantic analysis that a dictionary contains, where those analyses are present and accurate. "Semantics?" you may say, "of course we are interested in semantics! Our whole systems is based on Montague/Davidson/DRT." Sad to say, the word has two meanings. The one found in lexicography departments bears no relation to the "Journal of Semantics" one. The lexicographer is interested in distinguishing religious from other enthusiasms (Hanks, EURALEX '98) or 'virtual' as applied to corpora and corporations (Meyer et al, EURALEX '98) and will find nothing whatever of interest in "Three boys ate four oranges". Of course there are some topics of interest to both communities, but it remains painful to see how the two communities lack a common language. Proposals ========= If you are persuaded, you will probably be wondering, "but how can I know which dictionaries are better?" or "How can we improve the quality of the lexical resource we are developing?" This is central and not easily answered. Academic reviews are sometimes useful (whereas popular ones are just irritating, attending only to new words and swear) A recent ELRA report, on "lexicon validation" provides a helpful methodology, though it reflects the current use of dictionaries in LE so is focussed on syntax rather than semantics. See http://www.icp.grenet.fr/ELRA/validat.html Assessing a dictionary's quality involves looking closely at its analyses of meaning. The assessor will not want to look at all entries, so some attention should be paid to sampling. Each word in the sample will need close scrutiny, so fifty polysemous items is suitable. It should cover all salient word classes, `difficult' words and `easy' words, words from A and word from Z. (Most dictionaries are compiled alphabetically, and there is a substantial possibility that personnel, policies, and quality vary across the alphabet.) The assessors will need to spot bad analyses. This is not easy (and, of course, cannot be done by computer) and involves humanities rather than scientific skills. Particularly useful is, inevitably, experience of lexicography. ("Lexicography Masterclass" -- see http://ds.dial.pipex.com/town/lane/ae345/ -- may be of interest here.) The sample entries will need comparison with entries in other dictionaries, and also with other entries in the same lexical set to see whether, eg, 'breakfast', 'lunch' and 'supper' get equivalent definitions -- and if not, whether this is motivated. Another telling exercise is to take a set of corpus instances for the word, and to see how readily each can be allocated to a dictionary sense. There are many reasons why this is often hard, but it certainly succeeds in putting the spotlight on lexicographic quality. In sum, fear not. There are systematic and productive ways to look at lexicographic quality. Much work lies ahead, hard work, but fascinating. Just don't be dictionary dentists. With thanks to Marie-Helene Correard for conversations that helped forge the ideas (and apologies to dentists). A version of this piece appeared in the electonic newsletter LE Journal in Summer 1998.