close

Вход

Забыли?

вход по аккаунту

код для вставкиСкачать
How much information does a
language have?
• Shanon, C. Prediction and Entropy of Printed
English, Bell System Technical Journal, 1951
Motivation/Skills
Redundancy
The redundancy of ordinary English, not
considering statistical structure over
greater distances than about eight
letters, is roughly 50%. This means that
when we write
En_ _ _sh ha_f o_ w_ _t w_ w_ _te i_
dete_ _ _ _e_ b_ t_e str_ct_r_ _ f _ _ _
lang_ _ _ _ a_d H_ _f i_ c_os_n fre_ _ _
Redundancy =1-H/Hmax
Entropy
27
H   pi log pi
i
i1
How much information is produced on average
for each letter
Saisi par l'inspiration, il composa illico un lai, qui, suivant la tradition du Canticum
Canticorum Salomonis, magnifiait l'illuminant corps d'Anastasia : Ton corps, un grand
galion où j'irai au long-cours, un sloop, un brigantin tanguant sous mon roulis, Ton
front, un fort dont j'irai à l'assaut, un bastion, un glacis qui fondra sous l'aquilon du
transport qui m'agit,
‘L’Evêqe en effet est très streect: le clergé, de temps en temps, se permet de révéler ses
préférences envers des ‘événements’ frenchement débreedés, mets l’évêqe hème qe ses
fêtes respectent des règles sévères et les trensgresser, c’est fréqemment reesqer de se fère
relegger’.
>
SEnglish  4.13
SSpanish  4.01
How much information is
obtained by adding one letter?
0.131 E
0.105 T
SE E
0.082 A
SE


0.002
0.001
0.001
8E-04
X
J
Q
Z
 p(b )Log p(b )   p(b , j)Log p(b , j)
i
i
2
i
i
i, j
H  LimN  FN
2
i
 FN
Fn
Bits per letter
F0
4.75
F1
4.03
F2
3.32
F3
3.1
3 order
IN NO IST LAT WHEY CRATICT
FROURE BIRS GROCID PONDENOME
OF DEMONSTURES OF THE
REPTAGIN IS REGOACTIONA OF CRE.
Vocabulary size (no. % of content in OEC Example lemmas
lemmas)
10
25%
the, of, and, to, that,
#
Word
Probability
have
100
50%
from, because, go,
me, our, well, way
1
The
.071
1000
2
75%
of
7000
3

>1,000,000
90%
tackle, peak, crude,
purely, dude, modest
.03
95%
saboteur, autocracy,
calyx, conformist
and
50,000

girl, win, decide,
huge, difficult,
.034
series
99%

laggardly,
endobenthic,
pomological
#
Word
Probability
1
The
.071
2
of
.034
3
and
.03



Zipf’s Law
.1
Pn 
n
8727

 P 1
n 1
n
Fword  11.82
Fword
 2.62
Length
#
Is Word
English trying Probability
to warn us?
1
The
.071
2
of
.034
992-995 America ensure oil opportunity
2629-2634
bush admit specifically
agents smell
3
and
.03
denied
16047-16048 arafat unhealthy



How to continue?
Aoccdrnig to rseearch at an Elingsh uinervtisy, it
deosn't mttaer in waht oredr the ltteers in a wrod are,
the olny iprmoatnt tihng is that the frist and lsat
ltteer is at the rghit pclae. The rset can be a toatl
mses and you can sitll raed it wouthit a porbelm. Tihs
is bcuseae we do not raed ervey lteter by it slef but
the wrod as a wlohe.
Revealing the statistic of the
language
• Q…..
2034 words start with q
• ….q
8 words finish with q
q ….
….q0
Iraq0.1
Revealing the statistic of the
language
THERE IS NO REVERSE ON A MOT0RCYCLE
1115112112111511711121321227111141111131
FRlEND 0F MINE FOUND THIS OUT
861311111111111621111112111111
RATHER
THE OTHER
R
R
R DRAMATICALLY
R DAY
11
441111111151111111111161111111111111
1
# of times guessed
Position of the guessed letter
#
G
u
e
s
s
e
s
What is the probability to find the number 1
in the third position?
THE
111
REV
115
ERS
112
MOT
1 12
THA
112

q   p(i1 , i2 , j )
3
12
i1i2
THE
111
ANT
313
ERS
112
MOT
1 12
HER
222
LASCU
Probability to find
the number I in
the place N
q 
N
i
THA
112
HEN
113
ERS
112
TH_
1 13
AN_
312
HE_
221
REV
115
ERS
112
MOT
1 12
AND
311
 p(i ,...,i
1
i1 ...iN 1
N 1
, j)
Bounds
THERE IS NO REVERSE ON
A MOT0RCYCLE
F0
(all the letter have the same
probability)
F1
(each letter has its own
probability)
F2
(correlation of two letters)

FN
111511211211151171112132
1227111141111131
F0
(all the numbers have the
same probability)
27
  qiN LogqiN
F1
i 1
(each number has its own
probability)
F2
(correlation of two numbers)
Bounds
27
FN   q Logq
i 1
N
i
N
i
27
 i (q
i 1
N
i
 q ) Log(i )  FN
N
i 1
Entropy
4
4
4
47
47
.47
17
17
.17
4
1
q
4
2
q
13
13
.13
3
1.3
0.013
F4   q Logq
1
1.3
0.013
27
5
1.3
0.013
3
1.3
0.013



4
i
 i (q
i 1
4
i
4
i
 q ) Logi 
4
i 1
Bounds
Redundancy ~ 75%
Fn Bits per letter
F0
4.75
F1
F2
F3
4.03
3.32
3.1
1/--страниц
Пожаловаться на содержимое документа