Patent Translate Powered by EPO and Google Notice This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate, complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or financial decisions, should not be based on machine-translation output. DESCRIPTION JP2002091469 [0001] BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus provided with a microphone array. [0002] 2. Description of the Related Art In controlling a video conference system or a device by voice, it is extremely important to receive the voice of the speaker with high sound quality with a microphone located at a distance from the speaker. Therefore, a microphone array is attracting attention as a method for receiving the voice of the speaker with high sound quality even with a microphone located at a distance from the speaker. However, in order to receive the voice of the speaker with high sound quality using the microphone array, it is necessary to estimate the direction or position of the speaker. [0003] [Problems to be Solved by the Invention] However, in the research on estimation of the direction or position up to now, the sound source position is estimated (for example, "prior art document 1" Masato Abe, "Source estimation by multiple sensors", The Acoustical Society 51, No. 5, pp. 384-389, 1995 "and the like. Many attempts have been made, but it has been difficult to estimate the direction and position of the speaker. Also, along with this, there is a problem that it 04-05-2019 1 is not possible to improve the speech recognition rate by collecting noise from another direction. [0004] An object of the present invention is to solve the above problems and to provide a speech recognition apparatus capable of estimating the direction or position of a speaker to improve the speech recognition rate. [0005] SUMMARY OF THE INVENTION A speech recognition apparatus according to the present invention comprises a microphone array formed by juxtaposing a plurality of microphones at a predetermined interval, and the above-mentioned based on the electric signals outputted from the respective microphones. Estimation means for estimating the azimuth angle or position of at least one sound source received by the microphone array, and the azimuth angle or position of the at least one sound source estimated by the estimation means based on the electrical signals output from the microphones Using a hidden Markov model of speech and a hidden Markov model of noise based on at least one beamforming means for generating at least one beam signal corresponding to the direction of at least one beam signal and the at least one beam signal generated by the beamforming means To determine whether each beam signal is voice or nonvoice A constant section, when it is determined that the speech by the determining means, characterized by comprising a speech recognition means for outputting a speech recognition result by performing speech recognition on the beam signals. [0006] In the speech recognition apparatus, the Hidden Markov Model of the noise is preferably a Hidden Markov Model having a mixed Gaussian distribution generated based on a plurality of environmental sounds. [0007] DETAILED DESCRIPTION OF THE INVENTION Embodiments of the present invention will be described below with reference to the drawings. [0008] FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according to an embodiment of the present invention. 04-05-2019 2 The speech recognition apparatus according to this embodiment particularly includes the microphone array 10, and is characterized by including the direction estimation unit 13, the beam forming unit 14, and the sound source determination unit 16. That is, in this embodiment, direction estimation is performed based on the sound source signal received by the microphone array 10 to generate a beam signal, and based on the generated beam signal, noise of the speech model and the environmental sound using the HMM is generated. The model discriminates whether it is a voice or non-voice (sound source determination), discriminates whether or not the sound source is a specific speaker, and further, if the sound source is a speaker, voice to the voice It is characterized by performing recognition. [0009] For example, as shown in FIG. 2, a case will be considered where white noise of non-voice comes in from the front direction to the microphone array 10 and non-voice from the right direction. Here, in order to receive voice with high sound quality, it is necessary to estimate the sound source direction and then perform beamforming on the estimated direction. However, even if it is possible to estimate the sound source direction in this situation, it does not know which direction the speaker is. Therefore, in the present embodiment, in order to receive only the sound with high sound quality, the sound source determination unit 16 identifies the sound source. In the voice recognition apparatus shown in FIG. 1, a noise model of environmental sound created from various environmental sounds is generated by the sound source determination unit 16 with respect to the beam signal of the sound received with high sound quality formed by the beam forming unit 14 The speech model is used to calculate likelihood and to identify speech and non-speech to identify whether the source is a speaker. If the sound source is a speaker, the voice recognition unit 17 recognizes the voice. [0010] In the present embodiment, in order to perform sound source identification using a speech model 04-05-2019 3 using an HMM and a noise model of an environmental sound, various environmental sounds are required in advance for model creation. Therefore, in the present embodiment, an RWCP real environment speech / sound database (hereinafter referred to as "RWCP-DB"). (See, for example, prior art reference 2 S. Nakamura et al., Data Collection in Real Acoustical Environments for Sound Scene Understanding and Hands-Free Speech Recognition , Proc. See Eurospeech 99, pp. 2255-2258, 1999. The noise model of environmental sound was created using. Table 1 shows the contents of the environmental sound database. Although nine types of systems are shown in Table 1, environmental sound of about 100 types of 10,000 samples exists as whole data. [0011] [Table 1] Environmental sound database ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶ example ----------------------------------- impact based wood wood plate metallic gold plate such as tapping wood stick For example, tapping a plastic plastic case with a wood stick or tapping ceramic glass, etc. ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ --- behaviorbased particles fall system beans and elastic sound system applause, such as folding the rupture destruction-based disposable chopsticks, such as catching a friction-based sawtooth, such as injection of the gas injection system sprays, and the like is poured into a box ----------- ----------------------- electronic sound system power, such as the sound of musical instruments based wrapper, such as breaking the paper-based paper such as sound a characteristic metal accessories system bell Such as ring tones, such as the mechanical system mainspring of sound ---------------------------------- [0012] In the present embodiment, the noise waveform database memory 22 stores waveform signals of the above-mentioned 92 types of environmental sounds, and the noise HMM generation unit 32 generates noise waveforms which are environmental sounds in the noise waveform database memory 22. Based on the signal, using a known EM (Expectation Maximization) algorithm, for example, a noise hidden Markov model (hereinafter referred to as a hidden Markov model) having a mixed Gaussian distribution of a plurality of mixtures in three states so as to maximize output likelihood. It is called HMM. Is generated and output to the noise HMM memory 42 for storage. On the other hand, the phoneme-labeled speech waveform database memory 21 stores, for example, a speech waveform 04-05-2019 4 signal with a phoneme label when a specific speaker reads a predetermined sentence (text), and the phoneme HMM generation unit 31 stores the speech waveform database memory. For example, a total of 54 phonemes are used so that the output likelihood is maximized using a plurality of signal data for each phoneme based on the speech waveform signal in 21 and using the EM algorithm, and 3 for each phoneme In the state, a phoneme HMM having a mixture of Gaussian distributions is generated, and each phoneme HMM is output to and stored in the phoneme HMM memories 41-1 to 41-54. Here, these 54 phoneme HMMs constitute a speech HMM of a specific speaker, and are used together with the above-described noise HMM to discriminate between speech and non-speech in the sound source determination unit 16. Note that, for the speech recognition unit 17, phoneme-based word HMMs of non-specific speakers are generated in advance by a known method and stored in the phoneme-based word HMM memory 51. [0013] In FIG. 1, a microphone array 10 is constructed by juxtaposing a plurality of microphones 11 on a straight line at predetermined intervals, and each microphone 11 receives an input voice or non-voice and then outputs an audio signal. Alternatively, it is converted into an electric signal which is a non-voice signal and output to the A / D converter 12. Next, the A / D converter 12 A / D converts the electric signal output from each microphone 11 into a digital data signal at a predetermined sampling frequency, and outputs the digital data signal to the direction estimation unit 13 and the beam forming unit 14. [0014] The direction estimating unit 13 targets, of the input digital data signals, signals having a predetermined threshold level or higher, for example, a known whitening cross correlation method (hereinafter referred to as the CSP method). (See, for example, prior art reference 3 T. Nishiura et al., Localization of Multiple Sound Sources Based on a CSP Analysis with a Microphone Array , Proceedings of ICASSP 2000, pp. 1053-1056, 2000 ). ) To estimate CSP coefficients and time of arrival differences (DOA: Delay of Arrivals) of a plurality of digital data signals. Specifically, as shown in the following equation 1, after Fourier transforming a plurality of digital data signals and performing normalization with amplitude, phase difference is calculated and inverse Fourier transform is performed to calculate CSP coefficients. . Next, the arrival time difference can be estimated by calculating the time difference (time difference with strong correlation) τ in which the CSP coefficient increases. Here, when there is only one sound source, when estimating the sound source direction, after calculating the time difference τ by 04-05-2019 5 Equation 1 and Equation 2, the azimuth angle θ is estimated using Equation 3. The estimated azimuth angle θ is output from the direction estimation unit 13 to the beam forming unit 14. In the following equations, it is assumed that the signals si (n) and sj (n) are received by the microphones i and j, c is the speed of sound, d is the microphone spacing, and Fs is the sampling frequency. [0015] CSP i, j (k) = DFT-1 [(DFT [si (n)] DFT [sj (n)] *) / (¦ DFT [si (n)] ¦ DFT [sj (N)]) (2) τ = argmax (CSP ij (k)) (3) θ = cos-1 (c · τ / (Fs · d)) [0016] Even when there are a plurality of sound sources, the azimuth angle θ can be similarly calculated by a known method. In this case, information on the plurality of azimuth angles θ is output to the beam forming unit 14 and the beam forming unit 14 Generate multiple beam signals. Although the CSP method is used for direction estimation, the present invention is not limited to this, and an improvement method combining the addition method of CSP coefficients and clustering of multiple sound source directions using beamforming, sound source estimation by beamforming A known method such as a method, a sound source direction estimation method by the MUSIC method, or a sound source direction estimation method by the minimum variance method may be used. Further, in the present embodiment, only direction estimation is performed, but the intersection of two beams is specified as the position of the sound source using two sets of microphone arrays 10 provided with two sets of microphone arrays 10 and juxtaposed to each other. It may be configured to [0017] The beam forming unit 14 calculates a weighting factor based on the azimuth angle information from the direction estimating unit 13, a plurality of delay lines cascade-connected to each other, and signals from taps of the respective delay lines. A so-called transversal filter circuit or a delayand-sum array circuit is configured including a multiplier that multiplies by the weighting factor and an adder that adds output signals from the respective multipliers. The beam forming unit 14 generates at least one beam signal at at least one azimuth estimated by the direction estimation 04-05-2019 6 unit 13 based on each digital data signal output from the A / D converter 12 and azimuth angle information. Then, the feature extraction unit 15 outputs the result. Next, the feature extraction unit 15 extracts a feature vector including, for example, a sixteenth order mel cepstrum coefficient, a sixteenth order Δ mel cepstrum coefficient, and a Δ power based on at least one beam signal input thereto, and generates a sound source It is output to the determination unit 16 and the speech recognition unit 17. Then, based on the feature vector of each beam signal input, the sound source determination unit 16 converts the speech HMMs in the phoneme HMM memories 41-1 to 41-54 and the non-speech noise HMMs in the noise HMM memory 42. By using the HMM to calculate the likelihood and selecting the HMM with the highest likelihood, it is determined whether it is speech or non-speech (noise or environmental sound), and in the case of speech, which phoneme The determination information is output to the voice recognition unit 17. Furthermore, the speech recognition unit 17 is configured to use the phoneme-based word HMM memory 51 based on the feature vectors sequentially output from the feature extraction unit 15 when the sound source signal input by the sound source determination unit 16 is determined to be speech. The likelihood is calculated using the word HMM, and speech recognition is performed by the maximum likelihood criterion, and a character string of the speech recognition result is output. [0018] In the above embodiment, the direction estimation unit 13, the beam forming unit 14, the feature extraction unit 15, the sound source determination unit 16, the speech recognition unit 17, the phoneme HMM generation unit 31, and the noise HMM generation unit 32 includes, for example, a computer such as a digital computer, and the speech waveform database memory 21, the noise waveform database memory 22, the phoneme HMM memories 41-1 to 41-54, the noise HMM memory 42, and the phoneme base The word HMM memory 51 is configured of, for example, a storage device such as a hard disk memory. [0019] EXAMPLE The aforementioned RWCP-DB also has acoustic transfer characteristics measured using the microphone array 10 in various environments. Then, the virtual real environment experiment was conducted using the sound transfer characteristic in RWCP-DB, and environmental sound and voice. In this example, the sound source discrimination performance in the case where the sound source position is known was evaluated experimentally. The experimental environment is shown in FIG. The sound source is a target sound source in the front direction with respect to the microphone array 10 and a noise 04-05-2019 7 source in the right direction. Next, Table 2 shows the experimental conditions. Under these experimental conditions, one microphone 11 and one microphone array 10 were used to evaluate the sound source discrimination performance at each SNR. [0020] [Table 2] Experiment conditions ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ element , Element spacing 2.83 cm beam forming unit: delay-and-sum array sampling frequency: 12 kHz ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶Frame length: 32 msec. (Hamming window) Frame period: 8 msec. Feature vector: MFCC, ΔMFCC, Δ power HMM: Number of Gaussian mixed HMM acoustic models: 54 models of speech (54 phonemes), 1 model of non-speech (noise)------------̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ --------------------------- non-voice database: RWCP-DB non-speech (noise) model learning: environmental sound 92 types × 20- ---------------------------------- test (open): voice: phoneme balance 216 in a particular speaker MHT Non-speech: 92 types of environmental sound ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ------ acoustic transfer characteristic: RWCP-DB reverberation time: 0.0,0.3,1.3sec. ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ ̶̶̶̶̶̶ ̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶̶ [0021] In this experiment, the performance was evaluated by the sound source discrimination rate for voice and non-voice for 308 sounds including 216 words of voice and 92 kinds of environmental sounds. Furthermore, when the identification result is speech, speech recognition is performed to evaluate speech recognition performance. 04-05-2019 8 [0022] FIG. 3 shows the experimental results of the speech recognition apparatus of the comparative example when one microphone is used, and shows the speech recognition rate and the speech to the signal-to-noise power ratio (SNR) in the anechoic chamber and the reverberation variable chamber. FIG. 4 is a graph showing the recognition rate, and FIG. 4 is an experimental result of the speech recognition apparatus provided with the microphone array 10 of the present embodiment, and shows speech to signal-to-noise power ratio (SNR) in the anechoic chamber and the reverberation variable chamber. It is a graph which shows a recognition rate and a speech recognition rate. That is, FIG. 3 shows the experimental results in the anechoic chamber and the reverberation variable chamber when using the voice recognition device of the comparative example when one microphone is used, and the horizontal axis represents the signal to noise power ratio (SNR) The vertical axis indicates the sound source identification rate in the bar graph, and the speech recognition rate in the line graph. Further, FIG. 4 shows experimental results in an anechoic chamber and a variable reverberation chamber when the speech recognition apparatus provided with the microphone array 10 of this embodiment is used, and the horizontal axis represents the signal to noise power ratio (SNR). The vertical axis indicates the sound source identification rate in the bar graph, and the speech recognition rate in the line graph. [0023] As apparent from FIGS. 3 and 4, the sound source discrimination rate and the speech recognition rate are clearly improved by using a microphone array rather than one microphone. From this, the effectiveness of the microphone array can be confirmed. [0024] Further, according to FIG. 4, when the SNR is 0 dB, the reverberation variable room [reverberation time T60 = 1.3 sec. However, the sound source discrimination rate is 90% or more, and an anechoic chamber and a reverberation variable chamber [T60 = 0.3 sec. The sound source identification rate was 98% or more. Since this sound source identification rate is not different from the result of the SNR of 20 dB, it can be understood that high accuracy sound source identification is possible even in a low SNR environment. However, the speech recognition rate when the SNR is 0 dB is about 88% in the anechoic chamber, the reverberation variable chamber [T60 = 0.3 sec. ], Which is about 68%, which is more deteriorated than in the case 04-05-2019 9 where the SNR is 20 dB. In the future, it is considered necessary to study a beamformer capable of receiving voice with higher sound quality. However, when evaluated from the viewpoint of sound source identification based on HMM using a microphone array, high reverberation (a reverberation variable room [T60 = 1.3 sec. The high discrimination performance even in a low SNR environment), it was found that if the position of the sound source is known in advance, it can be sufficiently identified whether the sound source is a speaker. [0025] As apparent from this experiment, according to the speech recognition apparatus according to this embodiment, the direction estimation unit 13 estimates the position of the sound source to discriminate whether the sound source is speech or non-speech The rate can be improved as compared to the prior art, and the speech recognition rate in the case of speech can be significantly improved as compared to the prior art. Also, since noise source identification was performed using noise HMMs generated based on a plurality of environmental sounds, nonspeech discrimination against many types of environmental sounds and noises is possible regardless of the types of environmental sounds and noises. It can be performed at a high identification rate as compared to the prior art. [0026] As described above in detail, according to the present invention, in a speech recognition apparatus provided with a microphone array in which a plurality of microphones are juxtaposed at predetermined intervals, at least one of the microphones received by the microphone array is received. Estimating means for estimating the azimuth angle or position of one sound source, and generating at least one beam signal corresponding to the azimuth angle or position direction of at least one sound source estimated based on the electric signal output from each of the microphones Whether each of the beam signals is speech or non-speech using the beamforming means for generating, the hidden Markov model of speech and the hidden Markov model of noise based on the generated at least one beam signal When it is determined that the signal is voice, voice recognition is performed on the beam signal and voice recognition is performed. Since the speech recognition apparatus is configured to include speech recognition means for outputting the result, the identification rate of whether the sound source is speech or non-speech can be improved compared to the prior art, and it is speech. The speech recognition rate can be greatly improved as compared to the prior art. [0027] Also, since the noise source HMM generated based on a plurality of environmental sounds is 04-05-2019 10 used to identify the sound source, regardless of the types of environmental sounds and noises, non-speech for many types of environmental sounds and noises Can be performed at a high identification rate as compared to the prior art. [0028] Brief description of the drawings [0029] FIG. 1 is a block diagram showing the configuration of a speech recognition apparatus according to an embodiment of the present invention. [0030] FIG. 2 is a diagram showing the direction of target voice and white noise for the microphone array 10 of the experimental example. [0031] FIG. 3 is a graph showing experimental results of the speech recognition device of the comparative example. [0032] FIG. 4 is a graph showing experimental results of the speech recognition apparatus of the present embodiment. [0033] Explanation of sign [0034] DESCRIPTION OF SYMBOLS 10 Microphone array 11 Microphone 12 A / D converter 13 Direction estimation part 14 Beam forming part 15 Feature extraction part 16 Sound source determination part 17 Speech recognition part 21 Phoneme Labeled speech database memory, 22: noise waveform database memory, 31: phoneme HMM generator, 32: noise HMM generator, 41-1 to 41-54: phoneme HMM memory, 42: noise HMM memory, 51: phoneme based word HMM memory. 04-05-2019 11 04-05-2019 12
© Copyright 2021 DropDoc