close

Вход

Забыли?

вход по аккаунту

JP2002091469

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2002091469
[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a
speech recognition apparatus provided with a microphone array.
[0002]
2. Description of the Related Art In controlling a video conference system or a device by voice, it
is extremely important to receive the voice of the speaker with high sound quality with a
microphone located at a distance from the speaker. Therefore, a microphone array is attracting
attention as a method for receiving the voice of the speaker with high sound quality even with a
microphone located at a distance from the speaker. However, in order to receive the voice of the
speaker with high sound quality using the microphone array, it is necessary to estimate the
direction or position of the speaker.
[0003]
[Problems to be Solved by the Invention] However, in the research on estimation of the direction
or position up to now, the sound source position is estimated (for example, "prior art document
1" Masato Abe, "Source estimation by multiple sensors", The Acoustical Society 51, No. 5, pp.
384-389, 1995 "and the like. Many attempts have been made, but it has been difficult to
estimate the direction and position of the speaker. Also, along with this, there is a problem that it
04-05-2019
1
is not possible to improve the speech recognition rate by collecting noise from another direction.
[0004]
An object of the present invention is to solve the above problems and to provide a speech
recognition apparatus capable of estimating the direction or position of a speaker to improve the
speech recognition rate.
[0005]
SUMMARY OF THE INVENTION A speech recognition apparatus according to the present
invention comprises a microphone array formed by juxtaposing a plurality of microphones at a
predetermined interval, and the above-mentioned based on the electric signals outputted from
the respective microphones. Estimation means for estimating the azimuth angle or position of at
least one sound source received by the microphone array, and the azimuth angle or position of
the at least one sound source estimated by the estimation means based on the electrical signals
output from the microphones Using a hidden Markov model of speech and a hidden Markov
model of noise based on at least one beamforming means for generating at least one beam signal
corresponding to the direction of at least one beam signal and the at least one beam signal
generated by the beamforming means To determine whether each beam signal is voice or nonvoice A constant section, when it is determined that the speech by the determining means,
characterized by comprising a speech recognition means for outputting a speech recognition
result by performing speech recognition on the beam signals.
[0006]
In the speech recognition apparatus, the Hidden Markov Model of the noise is preferably a
Hidden Markov Model having a mixed Gaussian distribution generated based on a plurality of
environmental sounds.
[0007]
DETAILED DESCRIPTION OF THE INVENTION Embodiments of the present invention will be
described below with reference to the drawings.
[0008]
FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according
to an embodiment of the present invention.
04-05-2019
2
The speech recognition apparatus according to this embodiment particularly includes the
microphone array 10, and is characterized by including the direction estimation unit 13, the
beam forming unit 14, and the sound source determination unit 16.
That is, in this embodiment, direction estimation is performed based on the sound source signal
received by the microphone array 10 to generate a beam signal, and based on the generated
beam signal, noise of the speech model and the environmental sound using the HMM is
generated. The model discriminates whether it is a voice or non-voice (sound source
determination), discriminates whether or not the sound source is a specific speaker, and further,
if the sound source is a speaker, voice to the voice It is characterized by performing recognition.
[0009]
For example, as shown in FIG. 2, a case will be considered where white noise of non-voice comes
in from the front direction to the microphone array 10 and non-voice from the right direction.
Here, in order to receive voice with high sound quality, it is necessary to estimate the sound
source direction and then perform beamforming on the estimated direction.
However, even if it is possible to estimate the sound source direction in this situation, it does not
know which direction the speaker is. Therefore, in the present embodiment, in order to receive
only the sound with high sound quality, the sound source determination unit 16 identifies the
sound source. In the voice recognition apparatus shown in FIG. 1, a noise model of environmental
sound created from various environmental sounds is generated by the sound source
determination unit 16 with respect to the beam signal of the sound received with high sound
quality formed by the beam forming unit 14 The speech model is used to calculate likelihood and
to identify speech and non-speech to identify whether the source is a speaker. If the sound
source is a speaker, the voice recognition unit 17 recognizes the voice.
[0010]
In the present embodiment, in order to perform sound source identification using a speech model
04-05-2019
3
using an HMM and a noise model of an environmental sound, various environmental sounds are
required in advance for model creation. Therefore, in the present embodiment, an RWCP real
environment speech / sound database (hereinafter referred to as "RWCP-DB"). (See, for example,
prior art reference 2 S. Nakamura et al., Data Collection in Real Acoustical Environments for
Sound Scene Understanding and Hands-Free Speech Recognition , Proc. See Eurospeech 99, pp.
2255-2258, 1999. The noise model of environmental sound was created using. Table 1 shows
the contents of the environmental sound database. Although nine types of systems are shown in
Table 1, environmental sound of about 100 types of 10,000 samples exists as whole data.
[0011]
[Table 1] Environmental sound database
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶̶̶̶̶̶ example ----------------------------------- impact based wood wood plate
metallic gold plate such as tapping wood stick For example, tapping a plastic plastic case with a
wood stick or tapping ceramic glass, etc.
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ --- behaviorbased particles fall system beans and elastic sound system applause, such as folding the rupture
destruction-based disposable chopsticks, such as catching a friction-based sawtooth, such as
injection of the gas injection system sprays, and the like is poured into a box ----------- ----------------------- electronic sound system power, such as the sound of musical instruments based wrapper,
such as breaking the paper-based paper such as sound a characteristic metal accessories system
bell Such as ring tones, such as the mechanical system mainspring of sound ----------------------------------
[0012]
In the present embodiment, the noise waveform database memory 22 stores waveform signals of
the above-mentioned 92 types of environmental sounds, and the noise HMM generation unit 32
generates noise waveforms which are environmental sounds in the noise waveform database
memory 22. Based on the signal, using a known EM (Expectation Maximization) algorithm, for
example, a noise hidden Markov model (hereinafter referred to as a hidden Markov model)
having a mixed Gaussian distribution of a plurality of mixtures in three states so as to maximize
output likelihood. It is called HMM.
Is generated and output to the noise HMM memory 42 for storage. On the other hand, the
phoneme-labeled speech waveform database memory 21 stores, for example, a speech waveform
04-05-2019
4
signal with a phoneme label when a specific speaker reads a predetermined sentence (text), and
the phoneme HMM generation unit 31 stores the speech waveform database memory. For
example, a total of 54 phonemes are used so that the output likelihood is maximized using a
plurality of signal data for each phoneme based on the speech waveform signal in 21 and using
the EM algorithm, and 3 for each phoneme In the state, a phoneme HMM having a mixture of
Gaussian distributions is generated, and each phoneme HMM is output to and stored in the
phoneme HMM memories 41-1 to 41-54. Here, these 54 phoneme HMMs constitute a speech
HMM of a specific speaker, and are used together with the above-described noise HMM to
discriminate between speech and non-speech in the sound source determination unit 16. Note
that, for the speech recognition unit 17, phoneme-based word HMMs of non-specific speakers
are generated in advance by a known method and stored in the phoneme-based word HMM
memory 51.
[0013]
In FIG. 1, a microphone array 10 is constructed by juxtaposing a plurality of microphones 11 on
a straight line at predetermined intervals, and each microphone 11 receives an input voice or
non-voice and then outputs an audio signal. Alternatively, it is converted into an electric signal
which is a non-voice signal and output to the A / D converter 12. Next, the A / D converter 12 A /
D converts the electric signal output from each microphone 11 into a digital data signal at a
predetermined sampling frequency, and outputs the digital data signal to the direction estimation
unit 13 and the beam forming unit 14.
[0014]
The direction estimating unit 13 targets, of the input digital data signals, signals having a
predetermined threshold level or higher, for example, a known whitening cross correlation
method (hereinafter referred to as the CSP method). (See, for example, prior art reference 3 T.
Nishiura et al., Localization of Multiple Sound Sources Based on a CSP Analysis with a
Microphone Array , Proceedings of ICASSP 2000, pp. 1053-1056, 2000 ). ) To estimate CSP
coefficients and time of arrival differences (DOA: Delay of Arrivals) of a plurality of digital data
signals. Specifically, as shown in the following equation 1, after Fourier transforming a plurality
of digital data signals and performing normalization with amplitude, phase difference is
calculated and inverse Fourier transform is performed to calculate CSP coefficients. . Next, the
arrival time difference can be estimated by calculating the time difference (time difference with
strong correlation) τ in which the CSP coefficient increases. Here, when there is only one sound
source, when estimating the sound source direction, after calculating the time difference τ by
04-05-2019
5
Equation 1 and Equation 2, the azimuth angle θ is estimated using Equation 3. The estimated
azimuth angle θ is output from the direction estimation unit 13 to the beam forming unit 14. In
the following equations, it is assumed that the signals si (n) and sj (n) are received by the
microphones i and j, c is the speed of sound, d is the microphone spacing, and Fs is the sampling
frequency.
[0015]
CSP i, j (k) = DFT-1 [(DFT [si (n)] DFT [sj (n)] *) / (¦ DFT [si (n)] ¦ DFT [sj (N)]) (2) τ = argmax (CSP
ij (k)) (3) θ = cos-1 (c · τ / (Fs · d))
[0016]
Even when there are a plurality of sound sources, the azimuth angle θ can be similarly
calculated by a known method. In this case, information on the plurality of azimuth angles θ is
output to the beam forming unit 14 and the beam forming unit 14 Generate multiple beam
signals.
Although the CSP method is used for direction estimation, the present invention is not limited to
this, and an improvement method combining the addition method of CSP coefficients and
clustering of multiple sound source directions using beamforming, sound source estimation by
beamforming A known method such as a method, a sound source direction estimation method by
the MUSIC method, or a sound source direction estimation method by the minimum variance
method may be used. Further, in the present embodiment, only direction estimation is performed,
but the intersection of two beams is specified as the position of the sound source using two sets
of microphone arrays 10 provided with two sets of microphone arrays 10 and juxtaposed to each
other. It may be configured to
[0017]
The beam forming unit 14 calculates a weighting factor based on the azimuth angle information
from the direction estimating unit 13, a plurality of delay lines cascade-connected to each other,
and signals from taps of the respective delay lines. A so-called transversal filter circuit or a delayand-sum array circuit is configured including a multiplier that multiplies by the weighting factor
and an adder that adds output signals from the respective multipliers. The beam forming unit 14
generates at least one beam signal at at least one azimuth estimated by the direction estimation
04-05-2019
6
unit 13 based on each digital data signal output from the A / D converter 12 and azimuth angle
information. Then, the feature extraction unit 15 outputs the result. Next, the feature extraction
unit 15 extracts a feature vector including, for example, a sixteenth order mel cepstrum
coefficient, a sixteenth order Δ mel cepstrum coefficient, and a Δ power based on at least one
beam signal input thereto, and generates a sound source It is output to the determination unit 16
and the speech recognition unit 17. Then, based on the feature vector of each beam signal input,
the sound source determination unit 16 converts the speech HMMs in the phoneme HMM
memories 41-1 to 41-54 and the non-speech noise HMMs in the noise HMM memory 42. By
using the HMM to calculate the likelihood and selecting the HMM with the highest likelihood, it
is determined whether it is speech or non-speech (noise or environmental sound), and in the case
of speech, which phoneme The determination information is output to the voice recognition unit
17. Furthermore, the speech recognition unit 17 is configured to use the phoneme-based word
HMM memory 51 based on the feature vectors sequentially output from the feature extraction
unit 15 when the sound source signal input by the sound source determination unit 16 is
determined to be speech. The likelihood is calculated using the word HMM, and speech
recognition is performed by the maximum likelihood criterion, and a character string of the
speech recognition result is output.
[0018]
In the above embodiment, the direction estimation unit 13, the beam forming unit 14, the feature
extraction unit 15, the sound source determination unit 16, the speech recognition unit 17, the
phoneme HMM generation unit 31, and the noise HMM generation unit 32 includes, for example,
a computer such as a digital computer, and the speech waveform database memory 21, the noise
waveform database memory 22, the phoneme HMM memories 41-1 to 41-54, the noise HMM
memory 42, and the phoneme base The word HMM memory 51 is configured of, for example, a
storage device such as a hard disk memory.
[0019]
EXAMPLE The aforementioned RWCP-DB also has acoustic transfer characteristics measured
using the microphone array 10 in various environments.
Then, the virtual real environment experiment was conducted using the sound transfer
characteristic in RWCP-DB, and environmental sound and voice. In this example, the sound
source discrimination performance in the case where the sound source position is known was
evaluated experimentally. The experimental environment is shown in FIG. The sound source is a
target sound source in the front direction with respect to the microphone array 10 and a noise
04-05-2019
7
source in the right direction. Next, Table 2 shows the experimental conditions. Under these
experimental conditions, one microphone 11 and one microphone array 10 were used to
evaluate the sound source discrimination performance at each SNR.
[0020]
[Table 2] Experiment conditions
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ element , Element spacing 2.83 cm beam forming unit:
delay-and-sum array sampling frequency: 12 kHz
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶Frame length:
32 msec. (Hamming window) Frame period: 8 msec. Feature vector: MFCC, ΔMFCC, Δ power
HMM: Number of Gaussian mixed HMM acoustic models: 54 models of speech (54 phonemes), 1
model of non-speech (noise)------------̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ --------------------------- non-voice database: RWCP-DB non-speech (noise) model learning: environmental
sound 92 types × 20- ---------------------------------- test (open): voice: phoneme balance 216 in a
particular speaker MHT Non-speech: 92 types of environmental sound
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ------ acoustic transfer
characteristic: RWCP-DB reverberation time: 0.0,0.3,1.3sec.
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶
[0021]
In this experiment, the performance was evaluated by the sound source discrimination rate for
voice and non-voice for 308 sounds including 216 words of voice and 92 kinds of environmental
sounds. Furthermore, when the identification result is speech, speech recognition is performed to
evaluate speech recognition performance.
04-05-2019
8
[0022]
FIG. 3 shows the experimental results of the speech recognition apparatus of the comparative
example when one microphone is used, and shows the speech recognition rate and the speech to
the signal-to-noise power ratio (SNR) in the anechoic chamber and the reverberation variable
chamber. FIG. 4 is a graph showing the recognition rate, and FIG. 4 is an experimental result of
the speech recognition apparatus provided with the microphone array 10 of the present
embodiment, and shows speech to signal-to-noise power ratio (SNR) in the anechoic chamber and
the reverberation variable chamber. It is a graph which shows a recognition rate and a speech
recognition rate. That is, FIG. 3 shows the experimental results in the anechoic chamber and the
reverberation variable chamber when using the voice recognition device of the comparative
example when one microphone is used, and the horizontal axis represents the signal to noise
power ratio (SNR) The vertical axis indicates the sound source identification rate in the bar graph,
and the speech recognition rate in the line graph. Further, FIG. 4 shows experimental results in
an anechoic chamber and a variable reverberation chamber when the speech recognition
apparatus provided with the microphone array 10 of this embodiment is used, and the horizontal
axis represents the signal to noise power ratio (SNR). The vertical axis indicates the sound source
identification rate in the bar graph, and the speech recognition rate in the line graph.
[0023]
As apparent from FIGS. 3 and 4, the sound source discrimination rate and the speech recognition
rate are clearly improved by using a microphone array rather than one microphone. From this,
the effectiveness of the microphone array can be confirmed.
[0024]
Further, according to FIG. 4, when the SNR is 0 dB, the reverberation variable room
[reverberation time T60 = 1.3 sec. However, the sound source discrimination rate is 90% or
more, and an anechoic chamber and a reverberation variable chamber [T60 = 0.3 sec. The sound
source identification rate was 98% or more. Since this sound source identification rate is not
different from the result of the SNR of 20 dB, it can be understood that high accuracy sound
source identification is possible even in a low SNR environment. However, the speech recognition
rate when the SNR is 0 dB is about 88% in the anechoic chamber, the reverberation variable
chamber [T60 = 0.3 sec. ], Which is about 68%, which is more deteriorated than in the case
04-05-2019
9
where the SNR is 20 dB. In the future, it is considered necessary to study a beamformer capable
of receiving voice with higher sound quality. However, when evaluated from the viewpoint of
sound source identification based on HMM using a microphone array, high reverberation (a
reverberation variable room [T60 = 1.3 sec. The high discrimination performance even in a low
SNR environment), it was found that if the position of the sound source is known in advance, it
can be sufficiently identified whether the sound source is a speaker.
[0025]
As apparent from this experiment, according to the speech recognition apparatus according to
this embodiment, the direction estimation unit 13 estimates the position of the sound source to
discriminate whether the sound source is speech or non-speech The rate can be improved as
compared to the prior art, and the speech recognition rate in the case of speech can be
significantly improved as compared to the prior art. Also, since noise source identification was
performed using noise HMMs generated based on a plurality of environmental sounds, nonspeech discrimination against many types of environmental sounds and noises is possible
regardless of the types of environmental sounds and noises. It can be performed at a high
identification rate as compared to the prior art.
[0026]
As described above in detail, according to the present invention, in a speech recognition
apparatus provided with a microphone array in which a plurality of microphones are juxtaposed
at predetermined intervals, at least one of the microphones received by the microphone array is
received. Estimating means for estimating the azimuth angle or position of one sound source, and
generating at least one beam signal corresponding to the azimuth angle or position direction of
at least one sound source estimated based on the electric signal output from each of the
microphones Whether each of the beam signals is speech or non-speech using the beamforming
means for generating, the hidden Markov model of speech and the hidden Markov model of noise
based on the generated at least one beam signal When it is determined that the signal is voice,
voice recognition is performed on the beam signal and voice recognition is performed. Since the
speech recognition apparatus is configured to include speech recognition means for outputting
the result, the identification rate of whether the sound source is speech or non-speech can be
improved compared to the prior art, and it is speech. The speech recognition rate can be greatly
improved as compared to the prior art.
[0027]
Also, since the noise source HMM generated based on a plurality of environmental sounds is
04-05-2019
10
used to identify the sound source, regardless of the types of environmental sounds and noises,
non-speech for many types of environmental sounds and noises Can be performed at a high
identification rate as compared to the prior art.
[0028]
Brief description of the drawings
[0029]
FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according
to an embodiment of the present invention.
[0030]
FIG. 2 is a diagram showing the direction of target voice and white noise for the microphone
array 10 of the experimental example.
[0031]
FIG. 3 is a graph showing experimental results of the speech recognition device of the
comparative example.
[0032]
FIG. 4 is a graph showing experimental results of the speech recognition apparatus of the present
embodiment.
[0033]
Explanation of sign
[0034]
DESCRIPTION OF SYMBOLS 10 Microphone array 11 Microphone 12 A / D converter 13
Direction estimation part 14 Beam forming part 15 Feature extraction part 16 Sound source
determination part 17 Speech recognition part 21 Phoneme Labeled speech database memory,
22: noise waveform database memory, 31: phoneme HMM generator, 32: noise HMM generator,
41-1 to 41-54: phoneme HMM memory, 42: noise HMM memory, 51: phoneme based word
HMM memory.
04-05-2019
11
04-05-2019
12
1/--страниц
Пожаловаться на содержимое документа