Question Answering in TREC Ellen M. Voorhees National Institute of Standards and Technology Gaithersburg, MD 20899 [email protected] Traditional text retrieval systems return a ranked list of douments in response to a user's request. While a ranked list of douments an be an appropriate response for the user, frequently it is not. Usually it would be better for the system to provide the answer itself instead of requiring the user to searh for the answer in a set of douments. The Text REtrieval Conferene (TREC) is sponsoring a question answering \trak" to foster researh on the problem of retrieving answers rather than doument lists. TREC is a workshop series sponsored by the National Institute of Standards and Tehnology and the U.S. Department of Defense [7℄. The purpose of the onferene series is to enourage researh on text retrieval for realisti appliations by providing large test olletions, uniform soring proedures, and a forum for organizations interested in omparing results. The onferene has foused primarily on the traditional IR problem of retrieving a ranked list of douments in response to a statement of information need, but has also inluded other tasks, alled traks, that fous on new areas or partiularly diÆult aspets of information retrieval. A question answering trak was introdued in TREC-8 (1999). The trak has generated wide-spread interest in the QA problem [2, 3, 4℄, and has doumented signiant improvements in question answering system effetiveness in its two-year history. This paper provides a brief summary of the ndings of the TREC question answering trak to date and disusses the future diretions of the trak. The paper is extrated from a fuller desription of the trak given in \The TREC Question Answering Trak" [8℄. Complete details about the TREC question answering trak an be found in the TREC proeedings. 1. The TREC Question Answering Task \Question answering" overs a broad range of ativities from simple yes/no responses for true-false questions to the presentation of omplex results synthesized from multiple data soures. The spei task in the TREC trak was to return text snippets drawn from a large orpus of newspaper artiles in response to fat-based, short-answer questions suh as How many alories are there in a Big Ma? and Where is the Taj Mahal?. The TREC task was restrited in that only losed-lass questions were used, yet the subjet domain was essentially unonstrained sine the doument set was newspaper artiles. Partiipants were given the doument olletion and a test set of questions. Eah question was guaranteed to have at least one doument in the olletion that expliitly answered it. Partiipants returned a ranked list of ve [doumentid, answer-string℄ pairs per question suh that eah answer string was believed to ontain an answer to the question. Answer strings were limited to either 50 or 250 bytes depending on the run type, and ould either be extrated from the orresponding doument or automatially generated from information ontained in the doument. Human assessors read eah string and deided whether the string atually did ontain an answer to the question in the ontext provided by the doument. Given a set of judgments for the strings, the sore omputed for a submission was the mean reiproal rank. An individual question reeived a sore equal to the reiproal of the rank at whih the rst orret response was returned, or 0 if none of the ve responses ontained a orret answer. The sore for a submission was then the mean of the individual questions' reiproal ranks. Table 1 shows the data that was used in the two years of the trak. The TREC-9 trak used both a larger doument set and a larger test set of questions than the TREC-8 trak. A more substantive dierene was the soure of the questions used in the dierent years. The majority of the questions uses in the TREC-8 trak were developed speially for the trak. These questions were often bak-formulations of statements in the douments, whih made the questions somewhat unnatural and also made the task easier sine the target doument ontained most of the question words. For the TREC-9 trak, NIST obtained two query logs (one an Enarta log from Mirosoft and the other an Exite log) and used those as a soure of questions. NIST assessors heked whether eah andidate question had an answer in the doument olletion, and a andidate question was disarded if no answer was found. In many evaluations of natural language proessing tasks, appliation experts reate a gold-standard answer key that is assumed to ontain all possible orret responses. An absolute sore for a system's response is omputed by measuring the dierene between the response and the answer key. For text retrieval, however, dierent people are known to have TREC-8 number of douments 528,000 megabytes of doument text 1904 doument soures TREC disks 4{5 number of questions 200 question soures FAQ Finder log, assessors, partiipants TREC-9 979,000 3033 news from TREC disks 1{5 693 Enarta log, Exite log Table 1: Data used in the two TREC question answering traks. dierent opinions about whether or not a given doument should be retrieved for a query [5℄, so a single list of relevant douments annot be reated. Instead, the list of relevant douments produed by one person (the assessor) is used as an example of a orret response, and systems are evaluated on the sample. While the absolute sores of systems hange when dierent assessors are used, relative sores generally remain stable [9℄. A sub-goal of the TREC-8 QA trak was to investigate whether dierent people have dierent opinions as to what onstitutes an aeptable answer, and, if so, how those differenes aet QA evaluation. To aomplish this goal, eah question was independently judged by three dierent assessors. The separate judgments were ombined into a single judgment set through adjudiation for the oÆial trak evaluation, but the individual judgments were used to measure the eet of dierenes in judgments on systems' sores. The results of this assessment proess demonstrated that assessors do have legitimate dierenes of opinion as to what onstitutes an aeptable answer even for the deliberately onstrained questions used in the trak. Two prime examples of where suh dierenes arise are the ompleteness of names and the granularity of dates and loations. Fortunately, as with doument retrieval evaluation, the relative sores between QA systems remain stable despite dierenes in the judgments used to evaluate them [10℄. The lak of a denitive answer key does mean that evaluation sores are only meaningful in relation to other sores on the same data set, but this is unavoidable. If assessors' opinions of orretness dier, the eventual end users of the QA systems will have similar dierenes of opinion, and an evaluation of the tehnology must aommodate these dierenes. sore is a slightly worse result than the TREC-8 sore in absolute terms, it represents a very signiant improvement in question answering systems sine the TREC-9 question set was muh more diÆult. Reall that the TREC-9 questions were atual users' questions rather than questions onstruted speially for the trak. The motivation for using \real" questions was the belief that onstruted questions are easier for QA systems beause the question and answer doument share the same voabulary. While this is true, the dierene between the TREC-8 and TREC-9 question sets was larger than just voabulary issues. Real users ask vague questions suh as Who is Colin Powell? and Where do lobsters like to live? that are substantially more diÆult for the systems to answer. Most partiipants used a version of the following general approah to the question answering problem. The system rst attempted to lassify a question aording to its answer type as suggested by its question word. For example, a question beginning with \when" implies a time designation is needed. Next, the system retrieved a small portion of the doument olletion using standard text retrieval tehnology and the question as the query. The system performed a shallow parse of the returned douments to detet entities of the same type as the answer. If an entity of the required type was found suÆiently lose to the question's words, the system returned that entity as the response. If no appropriate answer type was found, the system fell bak to best-mathing-passage tehniques. Improvements in TREC-9 systems generally resulted from doing a better job of lassifying questions as to the expeted answer type, and using a wider variety of methods for nding the entailed answer types in retrieved passages. 2. 3. Trak Results Twenty dierent organizations partiipated in the TREC-8 question answering trak, submitting 20 runs using the 50-byte limit and 25 runs using the 250-byte limit. For TREC-9, 28 organizations partiipated, submitting 34 runs using the 50-byte limit and 44 runs using the 250-byte limit. Not surprisingly, allowing 250 bytes in a response was an easier task than limiting responses to 50 bytes: for every organization that submitted runs of both lengths, the 250-byte limit run had a higher mean reiproal rank. The relatively simple bag-of-words approahes that are suessfully used in text retrieval are also useful for retrieving answer passages as long as 250 bytes, but are not suÆient for extrating spei, fat-based answers [6℄. For the 50-byte limit runs, the best performing systems were able to answer about 70% of the questions in TREC-8 and about 65% of the questions in TREC-9. While the 65% Question Answering Test Colletions The primary way TREC has been suessful in improving doument retrieval performane is by reating appropriate test olletions for researhers to use when developing their systems. While reating a large olletion an be timeonsuming and expensive, one it is reated researhers an automatially evaluate the eetiveness of a retrieval run. One of the key goals of the QA trak was to build a reusable QA test olletion|that is, to devise a means to evaluate a QA run that uses the same doument and question sets but was not among the runs judged by the assessors. Unfortunately, the judgment sets produed by the assessors for the TREC QA trak do not onstitute a reusable test olletion beause the unit that is judged is the entire answer string. Dierent QA runs very seldom return exatly the same answer strings, and it is quite diÆult to determine automatially whether the dierene between a new string and a judged string is signiant with respet to the orretness of the answer. As an approximate solution to this problem, NIST reated a set of Perl string-mathing patterns from the set of strings that the assessors judged orret [10℄. An answer string that mathes any pattern for its question is marked orret, and is marked inorret otherwise. The patterns have been reated suh that almost all strings that were judged orret would be marked orret, sometimes at the expense of marking as orret strings that were judged inorret. Patterns are onstrained to math at word boundaries and ase is ignored. An average of 1.7 patterns per question was reated for the TREC-8 test set, with 65% of the questions having a single pattern. The TREC-9 set averaged 3.5 patterns per question with only 45% of the questions having a single pattern. The inrease in the number of patterns per question for the TREC-9 set is another indiation that the TREC-9 test set was more diÆult. Using the patterns to evaluate the TREC-8 runs produed dierenes in the relative sores of dierent systems that were omparable to the dierenes aused by using dierent human assessors, though the dierenes in relative sores for TREC-9 were somewhat larger. Furthermore, unlike the dierent judgments among human assessors, patterns misjudge broad lasses of responses|lasses that are preisely the ases that are diÆult for the original systems being evaluated. Thus, the patterns are not a true solution to the problem of building a reusable test olletion for question answering. Nonetheless, the patterns an be useful for providing quik feedbak as to the relative quality of alternative question answering tehniques. 4. The Future Evaluating ompeting tehnologies on a ommon problem set is a powerful way to improve the state of the art and hasten tehnology transfer. As the rst large-sale evaluation of domain-independent question answering systems, the TREC question answering trak looks to bring the benets of large-sale evaluation to bear on the question answering task. A roadmap for question answering researh was reently developed under the auspies of the DARPA TIDES projet [1℄. The roadmap desribes a highly ambitious program to inrease the omplexity of the types of questions that an be answered, the diversity of soures from whih the answers an be drawn, and the means by whih answers are displayed. The roadmap also inludes a ve year plan for introduing aspets of these researh areas as subtasks of the TREC QA trak. The QA trak in TREC 2001 (TREC-10) will inlude the rst steps of the roadmap. The main task in the trak will be similar to the task used in TRECs 8 and 9, but there will be no guarantee that an answer is atually ontained in the orpus. Reognizing that the answer is not available is hallenging, but it is an important ability for operational systems to possess sine returning an inorret answer is usually worse than not returning an answer at all. The trak will also ontain a list subtask in whih eah question asks for a spei number of instanes to be returned. The questions have been onstruted suh that retrieving the target number of instanes requires ollating information from more than one doument. For example, a list question suh as requires nding multiple douments that desribe the Pope's visits and extrating the ountry from eah. The system must also detet dupliate reports of the same visit so that ountries are listed only one per visit. Name the ountries the Pope visited in 1994. 5. REFERENCES [1℄ Sanda Harabagiu, John Burger, Claire Cardie, Vinay Chaudhri, Robert Gaizauskas, David Israel, Christian Jaquemin, Chin-Yew Lin, Steve Maiorano, George Miller, Dan Moldovan, Bill Ogden, John Prager, Ellen Rilo, Amit Singhal, Rohini Shrihari, Tomek Strzalkowski, Ellen Voorhees, and Ralph Weishedel. Issues, tasks, and program strutures to roadmap researh in question & answering (Q&A), Otober 2000. http://www-nlpir.nist.gov/projets/du/ roadmapping.html. [2℄ Sanda Harabagiu, Dan Moldovan, Marius Pasa, Rada Mihalea, Mihai Surdeanu, Razvan Bunesu, Roxana G^rju, Vasile Rus, and Paul Moraresu. The role of lexio-semanti feedbak in open-domain textual question-answering. In Proeedings of the Assoiation for Computational Linguistis, pages 274{281, July 2001. [3℄ Abraham Ittyheriah, Martin Franz, Wei-Jing Zhu, Adwait Ratnaparkhi, and Rihard J. Mammone. Question answering using maximum entropy omponents. In Proeedings of the 2nd Meeting of the North Amerian Chapter of the Assoiation for Computational Linguistis, pages 33{39, June 2001. [4℄ John Prager, Eri Brown, Anni Coden, and Dragomir Radev. Question-answering by preditive annotation. In Proeedings of the Twenty-Third Annual International ACM SIGIR Conferene on Researh and Development in Information Retrieval, pages 184{191, July 2000. [5℄ Linda Shamber. Relevane and information behavior. Annual Review of Information Siene and Tehnology, 29:3{48, 1994. [6℄ Amit Singhal, Steve Abney, Mihiel Baiani, Mihael Collins, Donald Hindle, and Fernando Pereira. AT&T at TREC-8. In E.M. Voorhees and D.K. Harman, editors, Proeedings of the Eighth Text REtrieval Conferene (TREC-8), 2000. NIST Speial Publiation 500-246. Eletroni version available at http://tre.nist.gov/pubs.html. [7℄ The Text REtrieval Conferene web site. http://tre.nist.gov. [8℄ Ellen M. Voorhees. The TREC question answering trak. Journal of Natural Language Engineering. To appear. [9℄ Ellen M. Voorhees. Variations in relevane judgments and the measurement of retrieval eetiveness. Information Proessing and Management, 36:697{716, 2000. [10℄ Ellen M. Voorhees and Dawn M. Tie. Building a question answering test olletion. In Proeedings of the Twenty-Third Annual International ACM SIGIR Conferene on Researh and Development in Information Retrieval, pages 200{207, July 2000.
© Copyright 2021 DropDoc