;docx

Dialogue 2014, Bekasovo
Anastasia Bonch-Osmolovskaya
NRU HSE


Association of Digital Humanities
Organizations (Europe, America, Australasia, Japan)
 Scholarly dissemination
 Big Data for Humanities
 Distant reading and complex network analysis and
vizualisation
 Linking cultural data: building standartized resources
and interoperability
 a stack of books of 2,7 m
height
 includes all published works,
variants, unpublished drafts,
diaries, letters, fragments
 13 volumes of diaries
 31 volume or 8500 letters
 about 14,5 mln tokens
 commentaries, indexes
Open cultural heritage
A project to digitise the entire works of Leo Tolstoy –
named All of Tolstoy in One Click – making them
available for tablets and smartphones, turned out to be
lighter work than expected for the Tolstoy Museum in
Moscow, when thousands of readers from all over the
world responded to a call for volunteers. (The
Guardian)
Now, thanks largely to the efforts of these volunteers,
nearly all of the great Russian writer’s massive body of
work, including novels, diaries, letters, religious tracts,
philosophical treatises, travelogues, and childhood
memories, will soon be available online, in a form that
can be easily downloaded, free of charge. (The New
Yorker)
 The idea of contemporary standards of cultural heritage
web publishing
 Tagging relevant structural elements of the text and textual
data
 Linking elements inside and outside the text
 project participants
 Tolstoy Museum (Fekla Tolstaya)
 High School of Economics, philology department (Boris
Orekhov, Anastasia Bonch-Osmolovskaya)
 Tartu University (Roman Leibov)
 ABBYY Compreno ( Anatoly Starostin)
 students of the philological department HSE
 What should be tagged?
 What tags should be used?
 Should we do it manually or automatically?
 Do we represent book or text? (Do we tag non-
Tolstoy’s texts?)
 What should be tagged? Everything that can be
tagged with TEI
 What tags should be used? TEI scheme
 Should we do it manually or automatically? It depends
 Do we represent volumes or texts? Text
 xml standard scheme for books encoding
 http://www.tei-c.org
 wiki, manuals, tutorials, events, discussions, groups of
interest
 ROMA - http://www.tei-c.org/Roma/ - customization
generator for TEI scheme
 critical apparatus
 readings,




variants
names dates
places
tables, formulae,
graphics, notated
music
language corpora
dictionaries
 linking,
segmentation,
alignment
 linguistic
annotation
 pos tagging
 certainty,
precision,
responsibility
 documentary texts
 literary texts
 prose
 verse
 performance texts
 spoken texts
 transcriptions of
speech
 manuscripts
 ancient texts
 on papyri, stone
 medieval texts
 illuminated msc
 modern texts
 variorum
 handwritten
 typewritten
<l>Я просыпаюсь. Я
<choice>
<orig>об'ят</orig>
<reg>объят</reg>
</choice>
<l>Открывшимся. Я на
<choice>
<orig>учете</orig>
<reg>учете.</reg>
</choice>
</l>
 create volume/text-type matrix
 select TEI schemes for different text types
 use modificated xml from ABBYY Finereader for
structural elements
 parse indexes and link them to text
 define intertextual links
 make Semantic Tolstoy cookbook
Улыбка <forename>Аграфены Петровны</forename>
означала, что письмо было от
<rolename>княжны</rolename>
<surname>Корчагиной</surname>, на которой, по
мнению <forename>Аграфены Петровны</forename>,
<surname>Нехлюдов</surname> собирался жениться.
И это предположение, выражаемое улыбкой
<forename>Аграфены Петровны</forename>, было
неприятно <surname>Нехлюдову</surname>.
Прямой полный
17 марта 1847 года
<date when="1847-17-03"> 17
марта 1847 года </date>
Прямой неполный Числа 22
<date when="1847-22-03">
Числа 22 </date>
Лучевой задний
Вот уже шестой день
Вот уже <date from="1847-2404" to="1848-01-01"> шестой
день </date>
Отрезковый наст.
Эту неделю я сижу дома
Эту <date from="1847-19-04"
to="1847-25-04"> неделю
</date> я сижу дома
Точечный прош.
Я совершенно доволен
Я совершенно доволен собою
собою за вчерашний день за <date when="1847-23-04">
вчерашний день </date>
 Student projects
 Old2New orthography transliterator
 Tolstoy corpus for ruscorpora
 Universal index parser
 Together with Compreno
 Named entity extraction
 Evaluation of NE merging (indexes as a Gold Standard)
 Fact extraction
Rufus Pollock,
Co-Founder and Director, Open
Knowledge Foundation