close

Вход

Забыли?

вход по аккаунту

JP2012145636

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2012145636
An object of the present invention is to perform robust speech recognition against mixing of nonstationary noise. SOLUTION: A first air conduction sound microphone 11 mounted in a body in a
sealed manner to pick up sound, a second air conduction sound microphone 21 mounted on the
body outside and picking up sound, and a first air conduction sound microphone 11 are
collected. A first decoder unit 16 that recognizes a first word string from the sounded sound, a
first word segment extraction unit 19 that extracts an utterance segment corresponding to each
word that constitutes the first word string, and a first word segment extraction unit The second
decoder unit 26 recognizes the second word string from the voice collected by the second air
conduction sound microphone 21 in the speech segment extracted by 19 and a word string
having a noise level equal to or less than a predetermined value in the first word string. And a
word string replacement unit 28 for replacing the second word string. [Selected figure] Figure 1
Speech recognition apparatus and speech recognition method
[0001]
The present invention relates to a speech recognition apparatus and speech recognition method
that perform speech recognition with high accuracy even in a noisy environment.
[0002]
In recent years, a device has been proposed in which a computer recognizes speech uttered by a
user and inputs recognition data.
03-05-2019
1
As shown in Patent Document 1, it is possible to input data to the system by hands-free and eyefree, especially in a situation where the hand is closed due to inspection or maintenance work, or
a scene where the eyes are close, etc. There is a high need for speech recognition as a device for
However, at the inspection or maintenance site, there are many large noises of the target
equipment and the surroundings, and these noises are mixed and input to the voice, so there is a
problem that the recognition accuracy is lowered.
[0003]
In order to solve this problem, a speech recognition apparatus under a noise environment has
been proposed. As a relatively effective conventional method, there are, for example, the
following methods. (A) Method of extracting only the voice of the speaker using a microphone
with strong directivity (a) A speaker voice comprising two microphones, a microphone for
picking up the speaker voice and a microphone for picking up the noise Method of emphasizing a
speaker's voice by a method of subtracting noise component from component (c) Method of
acquiring vibration transmitted to bone by bone conduction microphone which is a special
microphone constituted by an acceleration sensor
[0004]
However, in the method of (A), it is difficult to form a highly directional small microphone that
can be worn by the operator, and in the case of high noise, the noise may fall into the
microphone, (B) However, under high noise, the speaker's voice is buried in the noise component,
so a high effect can not be obtained. According to the method of (c), a high signal to noise ratio
(SNR) can be obtained, but with the bone conduction microphone There is a problem that
accuracy can not be obtained in speech recognition because high frequency components can not
be obtained.
[0005]
Patent Document 2 configured to solve the problems of methods (a) to (c) will be described with
reference to the drawings.
FIG. 16 is a block diagram showing the configuration of a conventional speech recognition
apparatus. FIG. 17 is a view showing the recognition result by the conventional speech
03-05-2019
2
recognition apparatus, and FIG. 17 (a) shows a sound collection spectrum, and FIG. 17 (b) shows
a collected sound waveform. As shown in FIG. 16, the conventional speech recognition apparatus
comprises a bone conduction microphone 91, an air conduction sound microphone 92, an A / D
conversion unit 93, a power calculation unit 94, a voice section detection unit 95, a decoder unit
96 for voice recognition, An acoustic model storage unit 97, a language model storage unit 98,
and a display unit 99 are provided.
[0006]
First, the bone conduction microphone 91 and the air conduction sound microphone 92 convert
the collected voice into an electric signal, and input the signal as analog data. The A / D
conversion unit 93 A / D converts and quantizes analog data acquired from the bone conduction
microphone 91 and the air conduction sound microphone 92, and then stores the data in a RAM
(not shown) or the like. The power calculation unit 94 extracts a power spectrum from the
quantization data stored in the RAM, using a short time Fourier analysis method of a known voice
signal or an LPC (Linear Predictive Coding) analysis method (refer to reference document
described later). Do. The voice section detection unit 95 determines the voice section using the
power spectrum of each of the bone conduction microphone 91 and the air conduction sound
microphone 92. The decoder unit 96 extracts a series of corresponding acoustic feature
quantities from the information of the power spectrum, and collates the acoustic model stored in
the acoustic model storage unit 97 with the language model stored in the language model
storage unit 98. Search for a word string closest to speech. The display unit 99 displays a word
string which is a search result of the decoder unit 96. Speech can be recognized by these
processes.
[0007]
Japanese Patent Application Laid-Open No. 11-228047 Japanese Patent Application Laid-Open
No. 4-276799
[0008]
However, the technology disclosed in Patent Document 2 described above has the following
problems.
Since the bone conduction microphone 91 is used, compared to the air conduction sound
03-05-2019
3
microphone 92, high frequency components can not be acquired. FIG. 17 (a) shows a spectrum in
which the same sound is recorded at the same time by using a bone conduction microphone and
an internally sealed microphone (air conduction sound microphone). As shown in FIG. 17A, in the
bone conduction microphone, a frequency of 1 KHz or more can not be obtained, so there is a
problem that the recognition accuracy is lowered. In addition, when a bone conduction
microphone is used, there is a problem that the contact sound of the human body and the device
is collected.
[0009]
FIG. 17 (b) shows a voice waveform in which the same voice and instrument operation sound
(voltage measurement by a tester) are simultaneously recorded by the bone conduction
microphone and the inside closed microphone (air conduction sound microphone). In the bone
conduction microphone, the device operation sound was collected in the same size as the sound
waveform, but in the inside-closing microphone, the influence of the device operation sound was
small. In the inspection work, there is a problem that the contact noise with the device occurs
when the operator handles the device, and this is largely picked up as a vibration. Furthermore,
the biggest problem is that the robustness against unsteady noise is low. The conventional
speech recognition apparatus has a problem that even if there is no problem in speech section
detection, if there is a large noise in a section even for a short time, word recognition results will
be affected in a chained manner, resulting in erroneous recognition.
[0010]
The present invention has been made to solve the above-described problems, and its object is to
perform robust speech recognition against the mixing of non-stationary noise.
[0011]
A voice recognition device according to the present invention includes a first air conduction
sound microphone that is sealingly mounted in the body and collects sound, a second air
conduction sound microphone that is mounted outside the body and collects sound, and a first
air conduction sound microphone. A first word string recognition unit that recognizes a first
word string from the sound picked up by the sound microphone; a first word segment extraction
unit that extracts a speech segment corresponding to each word that constitutes the first word
string; For the speech segment extracted by the word segment extraction unit, the second word
string recognition unit that recognizes the second word string from the sound collected by the
second air conduction sound microphone, and the noise level of the first word string is lower
than a predetermined value And a word string replacing unit that replaces the word string of the
03-05-2019
4
word string with the second word string.
[0012]
According to the present invention, of the first word string recognized from the voice collected
by the first air conduction microphone, a word string having a noise level equal to or less than a
predetermined value is recognized from the voice collected by the second air conduction
microphone Since the second word string is replaced with the second word string, it is possible to
recognize the word string robustly against the mixing of non-stationary speech.
[0013]
FIG. 1 is a block diagram showing a configuration of a speech recognition device according to a
first embodiment.
FIG. 2 is an explanatory view showing a configuration of first and second air-conduction sound
microphones according to Embodiment 1.
5 is a flowchart showing the operation of the speech recognition device according to the first
embodiment.
FIG. 6 is a diagram showing speech waveforms of the first and second air conduction sound
microphones of the speech recognition device according to the first embodiment. FIG. 6 is a
diagram showing the power of the first air conduction sound of the speech recognition device
according to the first embodiment. FIG. 7 is a diagram showing detection of a voice section of the
voice section detection unit of the voice recognition device according to the first embodiment.
FIG. 5 is an explanatory view showing word information corresponding to the start and end
frames of the speech recognition device according to the first embodiment. FIG. 7 is a diagram
showing the coherence of the first air-conduction sound microphone with respect to the second
air-conduction sound microphone of the speech recognition device according to the first
embodiment. FIG. 2 is a diagram showing first and second air conduction sound powers of the
speech recognition apparatus according to the first embodiment. FIG. 5 is a diagram showing
differential power of the speech recognition device according to the first embodiment. FIG. 6 is a
diagram showing the maximum differential power and the determination result of the speech
recognition device according to the first embodiment. FIG. 10 is a diagram showing a search
result of a second decoder unit of the speech recognition device according to the first
03-05-2019
5
embodiment. FIG. 1 is a diagram showing an example of a language model of a speech
recognition device according to a first embodiment. FIG. 7 is a block diagram showing the
configuration of a speech recognition device according to a second embodiment. 7 is a flowchart
showing the operation of the speech recognition apparatus according to the second embodiment.
It is a block diagram showing composition of the conventional speech recognition device. It is a
figure which shows the sound collection spectrum and sound collection audio ¦ voice waveform
of the conventional speech recognition apparatus.
[0014]
Hereinafter, the technical terms used in the description shall use the terms shown in the
following reference 1 to reference 3, and reference 1 to reference 3 shall be referred to for
details of known analysis methods. [Reference 1] Shiro Kano, Katsunobu Ito, Tatsuya Kawahara,
Kazuya Takeda, Mikio Yamamoto, "Speech Recognition System" Ohm Co., Ltd., May 15, 2001
[Reference 2] Kitakenji, "Probabilistic" Language Model "The University of Tokyo Press,
November 25, 1999 [Reference 3] Nakagawa Seiichi," Speech recognition by probability model
"The Institute of Electronics, Information and Communication Engineers, July 1, 1988
[0015]
Embodiment 1 FIG. 1 is a block diagram showing the configuration of a speech recognition
system according to a first embodiment of the present invention. The voice recognition device
includes first and second air conduction sound microphones 11, 21, A / D conversion units 12,
22, first and second speech data storage units 13, 23, first and second power calculation units
14, 24, a voice section detection unit 15, first and second decoder units (first and second word
string recognition units) 16, 26, a first acoustic model storage unit (acoustic model storage unit)
17, a language model storage unit 18, A first word section extraction unit 19, a word section
determination unit 25, a second acoustic model storage unit 27, a word string replacement unit
28, and a display unit 29.
[0016]
The first air conduction sound microphone 11 is a microphone that picks up the voice of the
speaker, and is an air conduction sound microphone that is inserted into the body. The second air
conduction sound microphone 21 is a close talk microphone that picks up sound at the mouth of
03-05-2019
6
the speaker. FIG. 2 is a diagram showing the configuration and structure of the first and second
air conduction sound microphones of the speech recognition apparatus according to the first
embodiment. FIG. 2 (a) shows the configuration and mounting example of the first and second air
conduction sound microphones, and FIG. 2 (b) shows the structure of the first air conduction
sound microphone. As shown in FIG. 2A, the first air-conduction sound microphone 11 and the
second air-conduction sound microphone 21 are connected by the boom 21 ′, and the first airconduction sound microphone 11 is inserted into the user's ear canal. The air conduction sound
microphone 21 is located at the mouth of the user. Furthermore, as shown in FIG. 2B, the first
air-conduction sound microphone 11 has a shape in which the small-sized microphone portion
11a can be inserted into the ear hole, and the insertion port is covered with the soundproof
member 11b. By inserting 11a, the ear canal is sealed to block external sound, and air
conduction sound transmitted from the tympanic membrane is collected.
[0017]
Next, the A / D conversion units 12 and 22 A / D convert and quantize the analog data input
from the first and second air conduction sound microphones 11 and 21. The first and second
speech data storage units 13 and 23 store the quantized data converted by the A / D conversion
units 12 and 22, respectively. The first power calculation unit 14 acquires quantization data of
the first air conduction sound microphone 11 from the first speech data storage unit 13 and uses
a short time Fourier analysis method or an LPC analysis method (refer to reference document) of
the audio signal. The power spectrum is extracted from the quantized data.
[0018]
The voice section detection unit 15 detects a voice section using the power spectrum of the first
air conduction sound microphone 11. The method of speech segment detection is known, so the
description is omitted. The first decoder unit 16 extracts a series of corresponding acoustic
feature quantities from the information of the power spectrum of the speech zone detected by
the speech zone detection unit 15, and stores the acoustic model stored in the first acoustic
model storage unit 17. By collating the language model stored in the language model storage
unit 18, a word string closest to the voice picked up by the first air conduction sound
microphone 11 is searched, and is output together with the voice section corresponding to each
word.
[0019]
03-05-2019
7
The first acoustic model storage unit 17 stores an acoustic model suitable for recognizing a voice
collected by the first air conduction sound microphone 11. The language model storage unit 18
stores language models. The first word section extraction unit 19 extracts an utterance section
corresponding to each word constituting the word string searched by the first decoder unit 16.
[0020]
The second power calculation unit 24 acquires the quantized data of the second air conduction
sound microphone 21 from the second speech data storage unit 23 for the voice of the section
corresponding to the speech section extracted by the first word section extraction unit 19. ,
Extract a power spectrum from the quantized data. The word section determination unit 25
determines the power spectrum of the first air conduction sound microphone 11 input from the
first word area extraction unit 19 and the power spectrum of the second air conduction sound
microphone 21 input from the second power calculation unit 24. It is determined whether to use
the speech section of the first air conduction sound microphone 11 or to use the speech area of
the second air conduction sound microphone 21 with reference to FIG.
[0021]
The second decoder unit 26 corresponds from the information of the power spectrum of the
second air conduction microphone 21 to the range of the partial utterance determined to use the
utterance period of the second air conduction sound microphone 21 in the word period
determination unit 25. A series of acoustic feature quantities is extracted, the acoustic model
stored in the second acoustic model storage unit 27 is compared with the language model stored
in the language model storage unit 18, and the voice of the second air conduction sound
microphone 21 is most Search for close word strings. The word string replacement unit 28
replaces the word string found in the corresponding second decoder unit 26 among the word
strings that are recognition results in the first decoder unit 16. The display unit 29 displays a
word string which is a result of replacement in the word string replacement unit 28.
[0022]
Next, details of the process of recognizing speech and outputting a word string in the speech
03-05-2019
8
recognition apparatus according to the first embodiment will be described with reference to the
flowchart of FIG. In addition, the case where machine operation noise is mixed as noise to the
voice of Te-an-zuke-ri-kai-sui-sen-chi (bottom safety distance securing switch) is taken as an
example, and the operation is explained along with the specific example. I do.
[0023]
The first and second air conduction sound microphones 11 and 21 pick up voices and convert
them into electric signals, and input the signals as analog data (step ST1). The A / D conversion
unit 12 A / D converts and quantizes the analog data input from the first air-conduction sound
microphone 11 in step ST1, and then stores the data as digital data in the first speech data
storage unit 13. Similarly, the A / D conversion unit 22 A / D converts and quantizes the analog
data input from the second air conduction microphone 21 in step ST1, and then stores it in the
second speech data storage unit 23 as digital data. It memorizes (step ST2). FIG. 4 shows a sound
waveform in which the time axis of the sound of the first air conduction sound microphone 11
and the time axis of the sound of the second air conduction sound microphone 21 are aligned. In
FIG. 4, the mechanical operation noise is superimposed between 1.43 seconds and 2.02 seconds.
Also, it can be understood from the sound waveform in FIG. 4 that there is a large difference
between the sound collection of the first air conduction sound microphone 11 and the sound
collection of the second air conduction sound microphone 21 in the superimposed portion of the
mechanical operation sound.
[0024]
The first power calculation unit 14 performs short-time Fourier spectrum analysis of the
quantization data of the first speech data stored in the first speech data storage unit 13 using the
LPC analysis method of the speech signal (in each analysis method For details, refer to the
bibliography). The framing process in the first embodiment is performed with a frame length of
20 ms and a frame interval of 10 ms. The first power calculator 14 stores the average of the
obtained power for each frame (step ST3). The power of the first air conduction sound stored for
each frame is shown in FIG.
[0025]
The voice section detection unit 15 detects a voice section using only the voice collected by the
03-05-2019
9
first air-conduction sound microphone 11 with reference to the power spectrum calculated by
the first power calculation unit 14 (step ST4). The voice section detected from the voice of the
specific example Te-A-J-K-K-H-S-T is shown in FIG.
[0026]
The first decoder unit 16 extracts a series of corresponding acoustic feature amounts from the
information of the power spectrum input from the speech section detection unit 15, and the first
acoustic model and the language stored in the first acoustic model storage unit 17 By collating
the language model stored in the model storage unit 18, a word string closest to the voice picked
up by the first air conduction sound microphone 11 is searched (step ST5). In detail, a search
process of a word string of frame synchronization described in the reference is performed. Thus,
the recognized word and the frame numbers of the start and end of the word are stored. In
addition, it is assumed that the word string "Seven A certain time" (seven safe distance securing
switch) is obtained from the voice shown in the specific example by the search processing of step
ST5. The correspondence between the search result and the speech waveform of the search
result is shown in FIG. 6, and the obtained start frame number and the information of the
recognition word corresponding to the end frame number are shown in FIG.
[0027]
Here, the first acoustic model used in the search in step ST5 is an acoustic model obtained by
learning in advance the collected voice of the first air conduction sound microphone 11 using an
HMM (Hidden Markov Model). The collected sound of the first air conduction sound microphone
11 is robust against external noise, but the characteristics are largely different from those of a
conventional microphone such as the second air conduction sound microphone 21. The
coherence of the first air-conduction sound microphone 11 with respect to the second airconduction sound microphone 21 is shown in FIG. The graph of FIG. 8 is obtained by recording
the voice of the balance sentence of the same speaker by the first air conduction microphone 11
and the second air conduction microphone 21 and calculating the coherence for each of the five
speakers.
[0028]
As shown in FIG. 8, the sound of the first air-conduction sound microphone 11 with respect to
03-05-2019
10
the second air-conduction sound microphone 21 has a very low correlation, and the variation
between speakers is large. For this reason, in the acoustic model (second acoustic model) learned
by the collected sound of the second air-conduction sound microphone 21 of the related art, the
second air-conduction sound can not be recognized because the sound recognition of the first airconduction sound microphone 11 is impossible. Aside from the microphone 21, an acoustic
model (first acoustic model) in which the collected voice of the first air conduction sound is
learned is required. The language model stored in the language model storage unit 18 is the
same as the word N-gram model (see the reference).
[0029]
The first word section extraction unit 19 extracts a speech section of a word corresponding to
each word searched by the first decoder unit 16 in step ST5 (step ST6). The speech section of the
word is obtained by the start frame number and the end frame number. The second power
calculation unit 24 extracts the speech of the section corresponding to the speech section of the
word extracted by the first word section extraction unit 19 in step ST6 from the second speech
data stored in the second speech data storage unit 23 And extract the power spectrum (step
ST7). FIG. 9 shows the power spectrum of the first and second air conduction sound microphones
11 and 21 corresponding to the extracted speech section. The power spectrum of the first air
conduction sound microphone 11 is indicated by a solid line, and the power spectrum of the
second air conduction sound microphone 21 is indicated by a broken line.
[0030]
The word section determination unit 25 calculates the difference power between the second air
conduction sound power X2 and the first air conduction sound power X1 by the following
equation (1), and calculates the maximum difference power Nw in the word area. The maximum
difference power Nw is the maximum value of the noise level in the corresponding word section.
<img class = "EMIRef" id = "205433260-00003" /> In the formula (1), w indicates a word
number, ws indicates a start frame number of a word, and we indicates a end frame number of a
word.
[0031]
If the maximum differential power Nw calculated based on the equation (1) exceeds the
03-05-2019
11
predetermined threshold (judgment 0), the speech section of the first air-conduction sound
microphone 11 is used as a speech recognition target and is within the predetermined threshold
(Determination 1) In the case where the utterance section of the second air conduction sound
microphone 21 is used as a voice recognition target, determination processing is performed (step
ST8). FIG. 10 shows the value of differential power for each frame number. Further, FIG. 11
shows the maximum differential power Nw in the speech section (starting frame number and
ending frame number) of each word and the judgment result by the word section judging unit 25,
the word number, and the recognition word which is the search result by the first decoder unit
16. It shows. In the example shown in FIG. 11, the predetermined threshold is set to 12 , and
the determination of the range of word numbers 1 and 2 (frames 567 to 1434) and the range of
word number 5 (frames 2164 to 2722) is 1 . The speech section of the second air conduction
sound microphone 21 is used as a speech recognition target. On the other hand, the
determination of the range of the word numbers 3 to 4 (frames 1434 to 2164) is 0 , and the
speech section of the first air conduction sound microphone 11 is used as a speech recognition
target.
[0032]
Subsequently, the second decoder unit 26 refers to the determination processing result of the
speech section in step ST8, and the range of word numbers (word numbers 1 to 2, 5 described
above) using the speech section of the second air conduction sound microphone 21. The second
acoustic model stored in the second acoustic model storage unit 27 and the language stored in
the language model 18 are extracted from the information of the power spectrum calculated by
the second power calculation unit 24. A word string closest to the voice of the second air
conduction sound microphone 21 is searched by checking the model (step ST9). The details of
the search process of the word string in the second decoder unit 26 will be described later.
[0033]
As a result of the search processing of the word string of the second decoder unit 26 in step ST9,
as a range of word numbers 1 and 2 (frames 567 to 1434), a range of word number 5
( bottom , safe) "Suchichi" (switch) is obtained as the frames 2164 to 2722). The
correspondence with the audio waveform is shown in FIG. The word string replacement unit 28
replaces the recognition result of the second decoder unit 26 with the corresponding word string
of the recognition result of the first decoder unit 16. In other words, the frames 567 to 1434
shown in FIG. 11 are replaced with Television (bottom safety), and the frames 2164 to 2722
are replaced with slip (switch) (step ST10).
03-05-2019
12
[0034]
The display unit 29 displays the word string Tie ほ ほ ほ ほ ほ ((底 (bottom safety distance
securing switch) on which the replacement process has been performed in step ST10 (step
ST11), and ends the process.
[0035]
Next, details of the word string search process in the second decoder unit 26 will be described.
The second decoder unit 26 is, for example, a second acoustic model of a phoneme HMM learned
in advance using an algorithm of Baum-Weltch (see the reference) and a language model stored
in the language model storage unit 18 (see FIG. 13). Using the data in), the word is modeled by a
dictionary of tree structure (refer to the bibliography).
[0036]
Also, using the N-gram grammar similarly recorded in the language model, the inter-word
transition probability is approximated by the following equation (2) to calculate the output
probability P (W) of the language model. <img class = "EMIRef" id = "205433260-00004" />
where W is a word string w1, w2,..., wn, N is the degree of N gram, and n is the number of words.
Here, W is decomposed into the word string W1 of the portion where the determination is 0
by the word section determination unit 25 and the word string W2 of the portion where the
determination is 1 by the word section determination unit 25. Calculated in degrees. That is,
in this embodiment, calculation is performed as follows using the 1-gram log probability of FIG.
<img class = "EMIRef" id = "205433260-000005" /> In the present embodiment, 1 gram is used
for simplicity of explanation, but as described below, a word concatenation probability of 2
grams or more is used to make a word The connection between the column W1 and the word
string W2 may be considered. At this time, "#" is a symbol at the beginning and end of a
sentence. <img class = "EMIRef" id = "205433260-000006" />
[0037]
03-05-2019
13
In addition, continuous speech recognition is performed by a search algorithm (refer to the
reference) using the above-described series of acoustic feature quantities and inter-word
transition probabilities. The collation of the part y of the input speech with mj is performed by
the following equation (3) with the HMM representing the acoustic feature quantity in phoneme
units. <img class = "EMIRef" id = "205433260-000007" /> Note that Y indicates phoneme strings
m1, m2, ..., mj.
[0038]
As a result, the sequence Y1 and word string W1 of the acoustic feature amount of the portion
extracted by the first decoder unit 16 and determined as 0 by the word interval
determination unit 25 and determined as 1 by the word interval determination unit 25 The
sequence Y2 of the acoustic feature quantities of the non-portion and the word string W2 are
obtained by the following equation (4). <img class = "EMIRef" id = "205433260-00008" /> W1
and W2 indicate partial word strings of the word strings w1, w2, ..., wn.
[0039]
As described above, according to the first embodiment, the first air-conducting sound
microphone 11 hermetically attached to the body is used as an anti-noise input microphone, and
the second air-conducting sound microphone 21 is used as the normal microphone. In the voice
section, the utterance section with a low noise level gives priority to the second air conduction
sound microphone 21, and in the sound production section with a high noise level, the utterance
section of the first air conduction sound microphone 11 is used. Further, since the determination
of the utterance section is configured to use the word information of the language model, it is
possible to recognize the word string robustly against non-stationary noise.
[0040]
Further, according to the first embodiment, the speech of the section corresponding to the
speech section of the word extracted by the first word section extraction unit 19 is extracted
from the second speech data to extract the power spectrum. Therefore, the power calculation
section in the second power calculation unit 24 can be limited.
[0041]
Further, according to the first embodiment, the second decoder unit 26 performs a search
03-05-2019
14
process for the word string corresponding to the determination result of the word section
determination unit 25, and the word string replacement unit 28 recognizes the recognition result
of the first decoder unit 16. Is replaced with the corresponding word string of the recognition
result of the second decoder unit 26. Therefore, the second decoder unit 26 only needs to
recognize the minimum necessary speech section, and efficiently removes the high noise level
part. Can.
In addition, even when local non-stationary noise is superimposed on part of the voice section,
voice can be recognized with high accuracy.
[0042]
Second Embodiment
In the second embodiment, a configuration is shown in which the first acoustic model is
automatically learned using a section with small noise in the second air-conduction sound
microphone 21. FIG. 14 is a block diagram showing the configuration of the speech recognition
apparatus according to the second embodiment. An operation input unit 31 and a first acoustic
model learning unit 32 are additionally provided in the speech recognition apparatus according
to the first embodiment. In the following, parts identical or corresponding to the constituent
elements of the speech recognition apparatus according to the first embodiment are given the
same reference numerals as the reference numerals used in the first embodiment, and the
description will be omitted or simplified.
[0043]
The operation input unit 31 is configured by an operation button or the like that allows the user
to confirm the speech recognition result displayed on the display unit 29 and then input the
designation of accept or reject the speech recognition result. It is an input means. The
first acoustic model learning unit 32 stores, as a learning section, a section in which the
recognition results of the first decoder section 16 and the second decoder section 26 are
different among the word sections that the word string replacement section 28 has replaced.
Furthermore, the connection learning of the word is performed using the voice by the first air
conduction sound microphone 11 of the obtained learning section and the received recognition
result. The learning result is stored in the first acoustic model storage unit 17 as an acoustic
03-05-2019
15
model for the collected voice of the first air conduction sound microphone 11.
[0044]
Next, the operation of the speech recognition apparatus according to the second embodiment will
be described. FIG. 15 is a flowchart showing the operation of the speech recognition apparatus
according to the second embodiment. Since the processing up to step ST11 is the same as the
operation of the speech recognition apparatus shown in the first embodiment, the description
will be omitted. When the speech recognition result is displayed on the display unit 29 in step
ST11, the user accepts or rejects the speech recognition result or inputs the speech recognition
result via the operation input unit 31. The operation input unit 31 determines whether or not it
has been input that the speech recognition result is to be accepted (step ST21).
[0045]
In step ST21, when the acceptance is input, the first acoustic model learning unit 32 acquires
information on word replacement from the word string replacement unit 28, and with the first
decoder unit 16 in the replaced word section, A process of extracting a word section having a
recognition result different from that of the second decoder unit 26 as a learning section is
performed (step ST22), and it is determined whether the learning section is present (step ST23).
If it is determined in step ST23 that there is a learning section, word connection learning is
performed using the voice collected by the first air-conduction sound microphone 11 of the
extracted learning section and the recognition result replaced (see the reference ) (Step ST24).
The acoustic model learned in step ST24 is stored in the first acoustic model storage unit 17 as a
first acoustic model (step ST25), and the process ends. On the other hand, when it is determined
in step ST21 that rejection is input or it is determined in step ST23 that a learning section does
not exist, the processing is ended without performing learning of the acoustic model.
[0046]
Further, the description will be made using the example (see FIG. 7) used in the first embodiment,
seven (seven) of the frames 567 to 922 and teve of the recognition result of the second
decoder unit The bottom part is different in the recognition result in the same section. Therefore,
the frames 567 to 922 are extracted as learning sections in step ST22, and it is determined that a
learning section exists in step ST23. Next, as step ST24, word concatenation learning of the
03-05-2019
16
acoustic feature series of the first utterance data corresponding to the frames 567 to 922 and
the recognition result word "te-bu" (bottom part) of the second decoder unit 26 is performed.
Thereafter, in step ST25, the connection learning result is stored in the first acoustic model
storage unit 17 as an acoustic model for the voice "T-e-bu" inputted to the first air-conduction
sound microphone 11.
[0047]
As described above, according to the second embodiment, when the speech recognition result
displayed on display unit 29 is accepted, first decoder unit 16 and the first decoder unit 16 in the
word section replaced by word string replacement unit 28 are selected. Concatenated learning is
performed with the word section having a recognition result different from that of the 2-decoder
unit 26 as the learning section, and the result of the connected learning is stored as the first
acoustic model. The first acoustic model of the sound microphone 11 can be learned to improve
the speech recognition accuracy. Further, by using the speech recognition apparatus, it is
possible to learn an acoustic model, and it becomes possible to improve speech recognition
accuracy under high noise.
[0048]
In the scope of the invention, the present invention allows free combination of each embodiment,
or modification of any component of each embodiment, or omission of any component in each
embodiment. .
[0049]
11 first air conduction sound microphone 11a microphone unit 11b soundproof member 12, 22
A / D conversion unit 13 first speech data storage unit 14 first power calculation unit 15 speech
section detection unit 16 first decoder , 17 first acoustic model storage unit, 18 language model
storage unit, 19 first word section extraction unit, 21 second air conduction sound microphone,
21 'boom, 23 second utterance data storage unit, 24 second power calculation unit , 25 word
section determination unit, 26 second decoder unit, 27 second acoustic model storage unit, 28
word string replacement unit, 29 display unit, 31 operation input unit, 32 first acoustic model
learning unit.
03-05-2019
17
1/--страниц
Пожаловаться на содержимое документа