Dialogue 2014, Bekasovo
Anastasia Bonch-Osmolovskaya
Association of Digital Humanities
Organizations (Europe, America, Australasia, Japan)
 Scholarly dissemination
 Big Data for Humanities
 Distant reading and complex network analysis and
 Linking cultural data: building standartized resources
and interoperability
 a stack of books of 2,7 m
 includes all published works,
variants, unpublished drafts,
diaries, letters, fragments
 13 volumes of diaries
 31 volume or 8500 letters
 about 14,5 mln tokens
 commentaries, indexes
Open cultural heritage
A project to digitise the entire works of Leo Tolstoy –
named All of Tolstoy in One Click – making them
available for tablets and smartphones, turned out to be
lighter work than expected for the Tolstoy Museum in
Moscow, when thousands of readers from all over the
world responded to a call for volunteers. (The
Now, thanks largely to the efforts of these volunteers,
nearly all of the great Russian writer’s massive body of
work, including novels, diaries, letters, religious tracts,
philosophical treatises, travelogues, and childhood
memories, will soon be available online, in a form that
can be easily downloaded, free of charge. (The New
 The idea of contemporary standards of cultural heritage
web publishing
 Tagging relevant structural elements of the text and textual
 Linking elements inside and outside the text
 project participants
 Tolstoy Museum (Fekla Tolstaya)
 High School of Economics, philology department (Boris
Orekhov, Anastasia Bonch-Osmolovskaya)
 Tartu University (Roman Leibov)
 ABBYY Compreno ( Anatoly Starostin)
 students of the philological department HSE
 What should be tagged?
 What tags should be used?
 Should we do it manually or automatically?
 Do we represent book or text? (Do we tag non-
Tolstoy’s texts?)
 What should be tagged? Everything that can be
tagged with TEI
 What tags should be used? TEI scheme
 Should we do it manually or automatically? It depends
 Do we represent volumes or texts? Text
 xml standard scheme for books encoding
 http://www.tei-c.org
 wiki, manuals, tutorials, events, discussions, groups of
 ROMA - http://www.tei-c.org/Roma/ - customization
generator for TEI scheme
 critical apparatus
 readings,
names dates
tables, formulae,
graphics, notated
language corpora
 linking,
 linguistic
 pos tagging
 certainty,
 documentary texts
 literary texts
 prose
 verse
 performance texts
 spoken texts
 transcriptions of
 manuscripts
 ancient texts
 on papyri, stone
 medieval texts
 illuminated msc
 modern texts
 variorum
 handwritten
 typewritten
<l>Я просыпаюсь. Я
<l>Открывшимся. Я на
 create volume/text-type matrix
 select TEI schemes for different text types
 use modificated xml from ABBYY Finereader for
structural elements
 parse indexes and link them to text
 define intertextual links
 make Semantic Tolstoy cookbook
Улыбка <forename>Аграфены Петровны</forename>
означала, что письмо было от
<surname>Корчагиной</surname>, на которой, по
мнению <forename>Аграфены Петровны</forename>,
<surname>Нехлюдов</surname> собирался жениться.
И это предположение, выражаемое улыбкой
<forename>Аграфены Петровны</forename>, было
неприятно <surname>Нехлюдову</surname>.
Прямой полный
17 марта 1847 года
<date when="1847-17-03"> 17
марта 1847 года </date>
Прямой неполный Числа 22
<date when="1847-22-03">
Числа 22 </date>
Лучевой задний
Вот уже шестой день
Вот уже <date from="1847-2404" to="1848-01-01"> шестой
день </date>
Отрезковый наст.
Эту неделю я сижу дома
Эту <date from="1847-19-04"
to="1847-25-04"> неделю
</date> я сижу дома
Точечный прош.
Я совершенно доволен
Я совершенно доволен собою
собою за вчерашний день за <date when="1847-23-04">
вчерашний день </date>
 Student projects
 Old2New orthography transliterator
 Tolstoy corpus for ruscorpora
 Universal index parser
 Together with Compreno
 Named entity extraction
 Evaluation of NE merging (indexes as a Gold Standard)
 Fact extraction
Rufus Pollock,
Co-Founder and Director, Open
Knowledge Foundation