Question Answering in TREC

Question Answering in TREC
Ellen M. Voorhees
National Institute of Standards and Technology
Gaithersburg, MD 20899
[email protected]
Traditional text retrieval systems return a ranked list of
douments in response to a user's request. While a ranked
list of douments an be an appropriate response for the
user, frequently it is not. Usually it would be better for the
system to provide the answer itself instead of requiring the
user to searh for the answer in a set of douments. The
Text REtrieval Conferene (TREC) is sponsoring a question answering \trak" to foster researh on the problem of
retrieving answers rather than doument lists.
TREC is a workshop series sponsored by the National
Institute of Standards and Tehnology and the U.S. Department of Defense [7℄. The purpose of the onferene series
is to enourage researh on text retrieval for realisti appliations by providing large test olletions, uniform soring proedures, and a forum for organizations interested in
omparing results. The onferene has foused primarily
on the traditional IR problem of retrieving a ranked list of
douments in response to a statement of information need,
but has also inluded other tasks, alled traks, that fous
on new areas or partiularly diÆult aspets of information retrieval. A question answering trak was introdued
in TREC-8 (1999). The trak has generated wide-spread
interest in the QA problem [2, 3, 4℄, and has doumented
signiant improvements in question answering system effetiveness in its two-year history.
This paper provides a brief summary of the ndings of
the TREC question answering trak to date and disusses
the future diretions of the trak. The paper is extrated
from a fuller desription of the trak given in \The TREC
Question Answering Trak" [8℄. Complete details about the
TREC question answering trak an be found in the TREC
The TREC Question Answering Task
\Question answering" overs a broad range of ativities
from simple yes/no responses for true-false questions to the
presentation of omplex results synthesized from multiple
data soures. The spei task in the TREC trak was to
return text snippets drawn from a large orpus of newspaper artiles in response to fat-based, short-answer questions
suh as How many alories are there in a Big Ma? and
Where is the Taj Mahal?. The TREC task was restrited in
that only losed-lass questions were used, yet the subjet
domain was essentially unonstrained sine the doument
set was newspaper artiles.
Partiipants were given the doument olletion and a test
set of questions. Eah question was guaranteed to have at
least one doument in the olletion that expliitly answered
it. Partiipants returned a ranked list of ve [doumentid, answer-string℄ pairs per question suh that eah answer
string was believed to ontain an answer to the question.
Answer strings were limited to either 50 or 250 bytes depending on the run type, and ould either be extrated from the
orresponding doument or automatially generated from
information ontained in the doument. Human assessors
read eah string and deided whether the string atually did
ontain an answer to the question in the ontext provided
by the doument. Given a set of judgments for the strings,
the sore omputed for a submission was the mean reiproal rank. An individual question reeived a sore equal to
the reiproal of the rank at whih the rst orret response
was returned, or 0 if none of the ve responses ontained
a orret answer. The sore for a submission was then the
mean of the individual questions' reiproal ranks.
Table 1 shows the data that was used in the two years of
the trak. The TREC-9 trak used both a larger doument
set and a larger test set of questions than the TREC-8 trak.
A more substantive dierene was the soure of the questions used in the dierent years. The majority of the questions uses in the TREC-8 trak were developed speially
for the trak. These questions were often bak-formulations
of statements in the douments, whih made the questions
somewhat unnatural and also made the task easier sine the
target doument ontained most of the question words. For
the TREC-9 trak, NIST obtained two query logs (one an
Enarta log from Mirosoft and the other an Exite log) and
used those as a soure of questions. NIST assessors heked
whether eah andidate question had an answer in the doument olletion, and a andidate question was disarded if
no answer was found.
In many evaluations of natural language proessing tasks,
appliation experts reate a gold-standard answer key that is
assumed to ontain all possible orret responses. An absolute sore for a system's response is omputed by measuring
the dierene between the response and the answer key. For
text retrieval, however, dierent people are known to have
number of douments
megabytes of doument text
doument soures
TREC disks 4{5
number of questions
question soures
FAQ Finder log, assessors,
news from TREC disks 1{5
Enarta log, Exite log
Table 1: Data used in the two TREC question answering traks.
dierent opinions about whether or not a given doument
should be retrieved for a query [5℄, so a single list of relevant
douments annot be reated. Instead, the list of relevant
douments produed by one person (the assessor) is used as
an example of a orret response, and systems are evaluated
on the sample. While the absolute sores of systems hange
when dierent assessors are used, relative sores generally
remain stable [9℄.
A sub-goal of the TREC-8 QA trak was to investigate
whether dierent people have dierent opinions as to what
onstitutes an aeptable answer, and, if so, how those differenes aet QA evaluation. To aomplish this goal, eah
question was independently judged by three dierent assessors. The separate judgments were ombined into a single
judgment set through adjudiation for the oÆial trak evaluation, but the individual judgments were used to measure
the eet of dierenes in judgments on systems' sores.
The results of this assessment proess demonstrated that
assessors do have legitimate dierenes of opinion as to what
onstitutes an aeptable answer even for the deliberately
onstrained questions used in the trak. Two prime examples of where suh dierenes arise are the ompleteness of
names and the granularity of dates and loations. Fortunately, as with doument retrieval evaluation, the relative
sores between QA systems remain stable despite dierenes
in the judgments used to evaluate them [10℄. The lak of a
denitive answer key does mean that evaluation sores are
only meaningful in relation to other sores on the same data
set, but this is unavoidable. If assessors' opinions of orretness dier, the eventual end users of the QA systems will
have similar dierenes of opinion, and an evaluation of the
tehnology must aommodate these dierenes.
sore is a slightly worse result than the TREC-8 sore in
absolute terms, it represents a very signiant improvement
in question answering systems sine the TREC-9 question
set was muh more diÆult. Reall that the TREC-9 questions were atual users' questions rather than questions onstruted speially for the trak. The motivation for using
\real" questions was the belief that onstruted questions
are easier for QA systems beause the question and answer
doument share the same voabulary. While this is true,
the dierene between the TREC-8 and TREC-9 question
sets was larger than just voabulary issues. Real users ask
vague questions suh as Who is Colin Powell? and Where
do lobsters like to live? that are substantially more diÆult
for the systems to answer.
Most partiipants used a version of the following general
approah to the question answering problem. The system
rst attempted to lassify a question aording to its answer type as suggested by its question word. For example, a
question beginning with \when" implies a time designation
is needed. Next, the system retrieved a small portion of the
doument olletion using standard text retrieval tehnology and the question as the query. The system performed
a shallow parse of the returned douments to detet entities of the same type as the answer. If an entity of the
required type was found suÆiently lose to the question's
words, the system returned that entity as the response.
If no appropriate answer type was found, the system fell
bak to best-mathing-passage tehniques. Improvements
in TREC-9 systems generally resulted from doing a better
job of lassifying questions as to the expeted answer type,
and using a wider variety of methods for nding the entailed
answer types in retrieved passages.
Trak Results
Twenty dierent organizations partiipated in the
TREC-8 question answering trak, submitting 20 runs using
the 50-byte limit and 25 runs using the 250-byte limit. For
TREC-9, 28 organizations partiipated, submitting 34 runs
using the 50-byte limit and 44 runs using the 250-byte limit.
Not surprisingly, allowing 250 bytes in a response was an
easier task than limiting responses to 50 bytes: for every organization that submitted runs of both lengths, the 250-byte
limit run had a higher mean reiproal rank. The relatively
simple bag-of-words approahes that are suessfully used in
text retrieval are also useful for retrieving answer passages
as long as 250 bytes, but are not suÆient for extrating
spei, fat-based answers [6℄.
For the 50-byte limit runs, the best performing systems
were able to answer about 70% of the questions in TREC-8
and about 65% of the questions in TREC-9. While the 65%
Question Answering Test Colletions
The primary way TREC has been suessful in improving doument retrieval performane is by reating appropriate test olletions for researhers to use when developing
their systems. While reating a large olletion an be timeonsuming and expensive, one it is reated researhers an
automatially evaluate the eetiveness of a retrieval run.
One of the key goals of the QA trak was to build a reusable
QA test olletion|that is, to devise a means to evaluate
a QA run that uses the same doument and question sets
but was not among the runs judged by the assessors. Unfortunately, the judgment sets produed by the assessors for
the TREC QA trak do not onstitute a reusable test olletion beause the unit that is judged is the entire answer
string. Dierent QA runs very seldom return exatly the
same answer strings, and it is quite diÆult to determine
automatially whether the dierene between a new string
and a judged string is signiant with respet to the orretness of the answer.
As an approximate solution to this problem, NIST reated
a set of Perl string-mathing patterns from the set of strings
that the assessors judged orret [10℄. An answer string that
mathes any pattern for its question is marked orret, and
is marked inorret otherwise. The patterns have been reated suh that almost all strings that were judged orret
would be marked orret, sometimes at the expense of marking as orret strings that were judged inorret. Patterns
are onstrained to math at word boundaries and ase is ignored. An average of 1.7 patterns per question was reated
for the TREC-8 test set, with 65% of the questions having
a single pattern. The TREC-9 set averaged 3.5 patterns per
question with only 45% of the questions having a single pattern. The inrease in the number of patterns per question
for the TREC-9 set is another indiation that the TREC-9
test set was more diÆult.
Using the patterns to evaluate the TREC-8 runs produed
dierenes in the relative sores of dierent systems that
were omparable to the dierenes aused by using dierent
human assessors, though the dierenes in relative sores
for TREC-9 were somewhat larger. Furthermore, unlike the
dierent judgments among human assessors, patterns misjudge broad lasses of responses|lasses that are preisely
the ases that are diÆult for the original systems being
evaluated. Thus, the patterns are not a true solution to the
problem of building a reusable test olletion for question
answering. Nonetheless, the patterns an be useful for providing quik feedbak as to the relative quality of alternative
question answering tehniques.
The Future
Evaluating ompeting tehnologies on a ommon problem
set is a powerful way to improve the state of the art and
hasten tehnology transfer. As the rst large-sale evaluation of domain-independent question answering systems, the
TREC question answering trak looks to bring the benets
of large-sale evaluation to bear on the question answering
A roadmap for question answering researh was reently developed under the auspies of the DARPA TIDES
projet [1℄. The roadmap desribes a highly ambitious program to inrease the omplexity of the types of questions
that an be answered, the diversity of soures from whih
the answers an be drawn, and the means by whih answers
are displayed. The roadmap also inludes a ve year plan
for introduing aspets of these researh areas as subtasks
of the TREC QA trak.
The QA trak in TREC 2001 (TREC-10) will inlude the
rst steps of the roadmap. The main task in the trak will
be similar to the task used in TRECs 8 and 9, but there will
be no guarantee that an answer is atually ontained in the
orpus. Reognizing that the answer is not available is hallenging, but it is an important ability for operational systems to possess sine returning an inorret answer is usually
worse than not returning an answer at all. The trak will
also ontain a list subtask in whih eah question asks for a
spei number of instanes to be returned. The questions
have been onstruted suh that retrieving the target number of instanes requires ollating information from more
than one doument. For example, a list question suh as
requires nding multiple douments that desribe the Pope's visits and
extrating the ountry from eah. The system must also detet dupliate reports of the same visit so that ountries are
listed only one per visit.
Name the ountries the Pope visited in 1994.
[1℄ Sanda Harabagiu, John Burger, Claire Cardie, Vinay
Chaudhri, Robert Gaizauskas, David Israel, Christian
Jaquemin, Chin-Yew Lin, Steve Maiorano, George
Miller, Dan Moldovan, Bill Ogden, John Prager, Ellen
Rilo, Amit Singhal, Rohini Shrihari, Tomek
Strzalkowski, Ellen Voorhees, and Ralph Weishedel.
Issues, tasks, and program strutures to roadmap
researh in question & answering (Q&A), Otober
[2℄ Sanda Harabagiu, Dan Moldovan, Marius Pasa, Rada
Mihalea, Mihai Surdeanu, Razvan Bunesu, Roxana
G^rju, Vasile Rus, and Paul Moraresu. The role of
lexio-semanti feedbak in open-domain textual
question-answering. In Proeedings of the Assoiation
for Computational Linguistis, pages 274{281, July
[3℄ Abraham Ittyheriah, Martin Franz, Wei-Jing Zhu,
Adwait Ratnaparkhi, and Rihard J. Mammone.
Question answering using maximum entropy
omponents. In Proeedings of the 2nd Meeting of the
North Amerian Chapter of the Assoiation for
Computational Linguistis, pages 33{39, June 2001.
[4℄ John Prager, Eri Brown, Anni Coden, and Dragomir
Radev. Question-answering by preditive annotation.
In Proeedings of the Twenty-Third Annual
International ACM SIGIR Conferene on Researh
and Development in Information Retrieval, pages
184{191, July 2000.
[5℄ Linda Shamber. Relevane and information behavior.
Annual Review of Information Siene and
Tehnology, 29:3{48, 1994.
[6℄ Amit Singhal, Steve Abney, Mihiel Baiani, Mihael
Collins, Donald Hindle, and Fernando Pereira. AT&T
at TREC-8. In E.M. Voorhees and D.K. Harman,
editors, Proeedings of the Eighth Text REtrieval
Conferene (TREC-8), 2000. NIST Speial
Publiation 500-246. Eletroni version available at
[7℄ The Text REtrieval Conferene web site.
[8℄ Ellen M. Voorhees. The TREC question answering
trak. Journal of Natural Language Engineering. To
[9℄ Ellen M. Voorhees. Variations in relevane judgments
and the measurement of retrieval eetiveness.
Information Proessing and Management, 36:697{716,
[10℄ Ellen M. Voorhees and Dawn M. Tie. Building a
question answering test olletion. In Proeedings of
the Twenty-Third Annual International ACM SIGIR
Conferene on Researh and Development in
Information Retrieval, pages 200{207, July 2000.