close

Вход

Забыли?

вход по аккаунту

JP2005303574

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2005303574
PROBLEM TO BE SOLVED: To provide a voice recognition headset which does not require
spherical wave approximation even when a sound source is close to a microphone array, and
does not require correction of arrival time differences between microphones even with respect to
the movement of a speaker. SOLUTION: The left and right ear pads 2 and 3 are connected by a
flexible support frame 1 and mounted on the head of the user, and an arm 6 extends from one of
the ear pads 2 and the arm The microphones 7A and 7B are disposed at 6. The microphones 7A
and 7B are located on the same spherical surface 9B centered about the user's mouth 8 when the
user wears the headset, and the microphones are fixed to the arm 6 so that the distances from
the mouth are equal. . The signal processing module 10 performs microphone array processing
on voice signals input from a plurality of microphones, suppresses noise and outputs voice
signals in which voice is emphasized, recognizes this voice output, and outputs a recognition
result. [Selected figure] Figure 1
Speech recognition headset
[0001]
The present invention is one of headset microphone techniques used in hands-free
communication, voice recognition, etc., and uses a plurality of microphones to enhance a target
audio signal from an input audio signal or detect a direction of a target audio signal. It relates to
technology.
[0002]
03-05-2019
1
When speech recognition technology is used in a real environment, ambient noise has a
significant effect on the recognition rate.
For example, when used in a car, there are many noises such as car engine noise, wind noise,
oncoming and overtaking vehicle sounds, and car stereo sounds. Also, even in relatively quiet
places such as offices, there are many noises that interfere with voice recognition such as
footsteps and the sound of closing doors. Further, the present invention is applied not only to a
voice input unit for voice recognition, but also to voice calls by telephone etc. under a noise
environment. These noises are mixed with the voice of the speaker and input to the voice
recognition device, which causes the recognition rate to be greatly reduced. One way to solve
such noise problems is to use a microphone array. The microphone array performs signal
processing on voices input from a plurality of microphones and outputs a signal in which the
target voice is emphasized. Specifically, the emphasis of the target voice is realized by forming
sharp directivity to the target voice direction and reducing the sensitivity in the other directions.
[0003]
For example, in the case of a delay-and-sum microphone array (see, for example, non-patent
document 1), the output signal Se (t) is a signal Sn (t) (n = 1,. , N) are obtained by shifting the
time difference τ according to the arrival direction of the target voice and adding. That is, the
emphasized speech signal Se (t) is
[0004]
N Se (t) = (Sn (t + nτ) (1) It is expressed as n = 1. However, it is assumed that the microphones
are arranged at regular intervals in the order of subscript n. The delay sum array forms
directivity in the direction of the target voice by utilizing the phase difference of the incoming
signal. That is, while the target signal is superimposed and in-phase in the same phase, noise
coming from a direction different from the target signal is based on the principle that the phases
are mutually offset and destructive.
[0005]
There is also a method of detecting the sound source direction using an adaptive beamformer.
03-05-2019
2
(See, for example, Patent Document 1). By the way, when the speaker speaks at a relatively short
distance to the microphone array, the voice becomes a spherical wave and reaches the
microphone. Therefore, even if the speaker speaks in front of the microphone array, the
microphones at the end will delay the arrival time of the sound waves as compared to the central
microphones constituting the microphone array. The method shown in (1) is a theory on the
assumption that the sound source is at an infinite distance and the sound wave can be
approximated as a plane wave, and if this assumption does not hold, that is, the sound source is
close to the size of the microphone array. In the case of the above, it is necessary to treat the
sound wave as a spherical wave. When treated as a spherical wave, in addition to the drawback
that the calculation is more complicated than a plane wave, even when the speaker moves in the
depth direction, the difference in arrival time of sound waves between microphones changes, so
this is kept constant. For this reason, there is a problem that the speaking position of the speaker
is limited.
[0006]
"Acoustic systems and digital processing", Chapter 7, The Institute of Electronics, Information
and Communication Engineers, 1995 JP-A-11-052977
[0007]
As described above, when the sound source is close to the microphone array, plane wave
approximation does not hold, spherical wave approximation is required, processing is
complicated, and the position of speech is limited.
[0008]
The present invention was made to solve the above problems, and does not require a spherical
wave approximation even when the sound source is close to the microphone array, and also
corrects the arrival time difference between the microphones with respect to the movement of
the speaker. To provide a speech recognition headset that does not require
[0009]
In order to solve the above problems, the present invention combines a plurality of microphones
for detecting voice and generating a voice signal, a microphone support portion for arranging the
plurality of microphones, and voice signals of the plurality of microphones. It is characterized by
comprising: an emphasized speech signal generating means for generating an emphasized speech
signal; and a speech recognition means for recognizing the emphasized speech signal.
03-05-2019
3
[0010]
In the present invention, a plurality of microphones for detecting voice and generating a voice
signal, a microphone support portion for arranging the plurality of microphones on the same
circumference centering on the mouth, and voice signals of the plurality of microphones It is
characterized by comprising emphasized speech signal generation means for synthesizing to
generate an emphasized speech signal, and speech recognition means for recognizing the
emphasized speech signal.
[0011]
In addition, a plurality of microphones that detect voice and generate a voice signal, a
microphone supporting unit that arranges the plurality of microphones on the same spherical
surface around the mouth, and voice signals of the plurality of microphones are synthesized and
emphasized It is characterized in that it comprises: emphasized speech signal generation means
for generating a speech signal; and speech recognition means for recognizing the emphasized
speech signal.
[0012]
Furthermore, it is characterized in that the distance between the plurality of microphones is
supported so as to be kept constant.
[0013]
According to the present invention, signal processing can be easily performed even when the
sound source is close to the microphone array, and the recognition rate can be improved.
In addition, the speech recognition rate can be maintained without limitation of the speaker
position.
[0014]
Hereinafter, embodiments of the present invention will be described with reference to the
drawings.
[0015]
03-05-2019
4
1 and 2 show the appearance of a voice recognition headset according to a first embodiment of
the present invention and a schematic system configuration thereof.
FIG. 1 (a) is a front view of the voice recognition headset, and FIG. 1 (b) is a side view seen from
the arrow direction of FIG. 1 (a).
The voice recognition headset detects the sound emitted from the support frame 1, the ear pads
2 and 3, the speakers 4 and 5, the arm 6, and the wearer (user) of the headset to generate an
electrical sound signal. Microphones 7A and 7B, and a signal processing module 10 for voice
recognition through digital conversion of the voice signal.
Although two microphones are used in FIG. 1 for the sake of simplicity, it is also possible to
arrange and implement three or more microphones.
Also, the directivity of the microphone may be oriented toward the mouth.
[0016]
This voice recognition headset (hereinafter sometimes referred to simply as headset ) has a
shape in which the left and right ear pads 2 and 3 are connected by the flexible support frame 1
and is mounted on the head of the user To use.
An arm 6 extends from one of the ear pads 2, and the microphones 7A and 7B are disposed on
the arm 6.
The microphones 7A and 7B are located on the same circumference 9A around the user's mouth
8 when the user wears the headset, and the microphones are fixed to the arm 6 so that the
distances from the mouth are equal. Do.
[0017]
03-05-2019
5
In the earpiece 2, speakers 5 (left and right) and a signal processing module 16 are incorporated.
The signal processing module 16 may be built in the earpiece 3. Although not shown, each
element is connected by a cable as needed.
[0018]
As shown in FIG. 2, the voice inputted from the first microphone 7A is subjected to analog-digital
conversion by the first AD converter 11A and inputted to the microphone array signal processing
unit 12. Similarly, the voice inputted from the second microphone 7 B is subjected to analogdigital conversion by the second AD converter 11 B and inputted to the microphone array signal
processing unit 12. The microphone array signal processing unit 12 performs microphone array
processing on audio signals input from a plurality of microphones to suppress noise and output
an audio signal in which voice is emphasized. The speech recognition unit 13 recognizes this
speech output and outputs a recognition result.
[0019]
FIG. 3 is a diagram showing an example of the internal structure of the microphone array signal
processing unit 12. The voice signal of the microphone converted into the digital signal is
synthesized by the adder 121, and the voice in which the voice signal of each of the microphones
7A and 7B is emphasized is output. This is because the microphones 7A and 7B are disposed on
the same circumference 9A centered on the mouth 8, so that the voice is at an equal distance
from the sound source at the mouth 8 without causing a delay due to a spherical wave from the
sound source. It is input. Therefore, the voice coming from the mouth 8 is emphasized by adding
signals of the same phase without requiring a separate delay device.
[0020]
FIG. 4 is a diagram showing an example of the internal structure of another microphone array
signal processing unit 12. This is for suppressing noise and emphasizing speech using an
adaptive beamformer, and the microphone array signal processing unit 12 is configured of an
03-05-2019
6
adder 121, a beamformer processing unit 122, and a speech enhancement unit 123.
[0021]
Here, it demonstrates using the internal structure of the beamformer process part 122 using FIG.
The beam former processing unit 122 performs filter operation processing called adaptive beam
former processing for suppressing an audio signal from a sound source on an audio signal
obtained by analog-digital conversion of the microphones 7A and 7B. Various methods are
known as processing methods inside the beamformer processing unit 122, such as a generalized
side lobe canceller (GSC), a frost-type beamformer, and a reference signal method. In this
embodiment, any adaptive beamformer can be applied, but here, a GSC of two channels will be
described as an example.
[0022]
FIG. 5 shows, as an example of the beamformer processing unit 122, a configuration example of
a general Jim-Griffith type GSC among GSCs of two channels. The beamformer processing unit
122 is a GSC composed of a subtractor 1221, an adder 1222, a delay 1223, an adaptive filter
1224 and a subtractor 1225. As the adaptive filter 1224, various filters such as LMS, RLS, and
projected LMS can be used, and the filter length La is, for example, La = 50. The delay amount of
the delay unit 1223 is, for example, La / 2.
[0023]
When the LMS adaptive filter is used for the two-channel Jim-Griffith GSC adaptive filter 1224
shown in FIG. 5 that composes the beam former 122, updating of this filter is performed by
setting the time of the time as n and the coefficient of the adaptive filter 24 n), the input signal of
the ith channel is xi (n), and the input signal vector of the ith channel is Xi (n) = (xi (n), xi (n-1),...,
xi (n-La + 1)) It will be expressed by the following equation.
[0024]
y (n) = x0 (n) + xl (n) (2) X '(n) = X1 (n) -X0 (n) (3) e (n) = y (n) -W (n) X' (N) (4) W (n + 1) = W (n) one μX ′ (n) e (n) (5) When a signal arrives from the direction of the target sound source, the
filter in the beamformer processing unit 122 is the target sound source Since the sensitivity is
lowered in the direction of, the direction of the target sound source can be estimated by
examining the directivity, which is the direction dependency of the sensitivity, from the filter
03-05-2019
7
coefficients of this fitter.
[0025]
By the way, in an environment where the noise source direction is not specified because there
are very many noise sources, the noise suppression performance by the beamformer is degraded,
but since the input speech is directional, the beamformer in which the target direction is set to
the noise direction By this, it is possible to extract the output of only the noise which suppressed
the signal from the target sound source.
Therefore, the output of the beamformer processing unit 122 is a noise-only signal, and the
speech enhancement unit 123 emphasizes speech by a method of spectral subtraction (SS) using
this.
[0026]
There are 2ch SS which uses 2 channels of noise signal for reference and audio signal in
spectrum subtraction and 1ch SS which uses only audio signal of 1 channel, but in this
embodiment the output of beamformer processing unit 122 as noise for reference Enhance voice
by 2ch SS using.
Normally, as the noise signal of 2ch SS, the signal of the microphone separated from the
microphone for the target sound collection is used so that the target sound is not input, but the
nature of the noise signal is different from the noise mixed in the target sound collection
microphone And there is a problem that the accuracy of SS falls.
[0027]
On the other hand, in the present embodiment, since the noise signal is extracted by the
microphone array system using a plurality of microphones without using a microphone dedicated
for noise collection, there is no problem that the nature of the noise is different. Can do SS.
[0028]
03-05-2019
8
The 2ch SS has, for example, a configuration as shown in FIG. 6, and the processing of FIG. 6 is
performed block by block of input data.
The 2ch SS shown in FIG. 6 includes a first FFT 1231 for Fourier transforming a noise signal, a
first band power converter 1232 for converting frequency components obtained by the first FFT
into band power, and time for the obtained band power Noise power calculator 1233 for
averaging in the direction, a second FFT 1234 for Fourier transforming a voice signal, a second
band power converter 1235 for converting frequency components obtained by the second FFT
into band power, and A voice power calculation unit 1236 that averages band power in the time
direction, a bandwidth weight calculation unit 1237 that calculates a weight for each bandwidth
from the obtained noise power and voice power, a frequency obtained by the second FFT from
the voice signal Weighting unit 1238 for weighting the spectrum with the weight for each band,
inverse FFT of the weighted frequency spectrum to obtain sound An inverse FFT1239 to output
a.
[0029]
The block length is, for example, 256 points, and is made to coincide with the point of FFT. At the
time of FFT, for example, windowing is performed with a Hanning window, and the same
processing is repeated while shifting by 128 points which is a half of the block length. Finally,
the waveform of the processing result obtained by inverse FFT is added while overlapping by
128 points to restore the windowed transformation and output.
[0030]
For conversion to the band power, for example, as shown in Table 1, the frequency components
are divided into 16 bands, and the sum of squares of the frequency components is calculated for
each band to obtain band power. The calculation of the noise power and the voice power is
performed for each band, for example, by a first-order regression filter as follows.
[0031]
pk, n = a.ppk + (1-a) .pk, n-1 (6) vk, n = a. vvk + (1-a). vk, n-1 (7) where k is Band number, n:
03-05-2019
9
block's sign, p: band power of averaged noise channel, pp: band power of this block of noise
channel, v: averaged band power of voice channel, vv: voice channel The band power of this
block, a is a constant. The value of a uses, for example, 0.5.
[0032]
[0033]
Next, in the band weight calculation unit, using the obtained noise and the band power of speech,
for example, the weight wk, n for each band is calculated by the following equation.
w k, n = ¦ v k, n −pk, n ¦ / v k, n (8) Next, using the weight for each band, for example, the
frequency component of the voice channel is weighted by the following equation. Yi, n = Xi, n wk,
n (9) where Yi, n is the weighted frequency component, Xi, n is the frequency component
obtained by the second FFT of the speech channel, and i is the frequency component The weight
wk, n of the band k corresponding to the frequency component number i in Table 1 is used.
[0034]
A flow of processing of the 2ch SS speech enhancement unit 123 will be described with reference
to FIG. First, initialization is performed, for example, block length = 256, number of FFT points =
256, number of shift points = 128, number of bands = 16 (step S101). Next, data of the noise
channel is read in the first FFT 1231, windowing and FFT are performed, and frequency
components of noise are obtained (step S102). Next, voice channel data is read in the second FFT
1234, windowing and FFT are performed to obtain voice frequency components (step S103).
Next, in the first band power conversion unit 1232, the band power of the noise is calculated
according to the correspondence of Table 1 from the frequency component of the noise (step
S104). Next, in the second band power converter 1235, the band power of the speech is
calculated from the frequency components of the speech according to the correspondence of
Table 1 (step S105). Next, in the noise power calculation unit 1233, the average noise power is
obtained according to equation (6) (step S106). Next, in the voice power calculation unit 1236,
the average voice power is obtained according to the equation (7) (step S107). Next, in the band
weight calculation unit 1237, the band weight is obtained according to equation (8) (step S108).
Next, the weighting factor obtained in step S108 is weighted according to equation (9) with
respect to the frequency component of the speech in the weighting unit 1238 (step S109). Next,
03-05-2019
10
in the inverse FFT 1239, the frequency component weighted in step S109 is inverse FFT to
obtain a waveform, and the waveform is superimposed on the last 128 points of the waveform
obtained up to the previous block and output (step S110).
[0035]
The steps S102 to S110 are repeated until there is no input. In addition, it is convenient to block
process this process in synchronization with the whole process including the process of the beam
former, in which case the block length of the beam former is made to coincide with the shift
length of 128 points of the voice emphasizing unit. . As described above, the speech with the
noise suppressed by the speech emphasizing unit 123 is output, and the speech recognition unit
13 recognizes the speech.
[0036]
Here, the speech recognition unit 13 will be described with reference to FIG. FIG. 8 shows an
internal configuration of the speech recognition unit 13. The audio output of the microphone
array signal processing unit 12 is first input to the acoustic analysis unit 131. The acoustic
analysis unit 131 converts the input speech into feature parameters. As typical feature
parameters used for speech recognition, a power spectrum which can be obtained by a band pass
filter or a Fourier transform, a cepstrum coefficient obtained by LPC (linear prediction) analysis,
etc. are often used. There is no limitation on the type of feature parameter. The acoustic analysis
unit 131 converts the input speech into feature parameters at regular intervals. Therefore, the
output is a time series of feature parameters (feature parameter series). This feature parameter
sequence is supplied to the model matching unit 132.
[0037]
On the other hand, in the recognition vocabulary storage unit 133, reading information of words
necessary to create a speech model of each word constituting the recognition vocabulary, and an
identifier corresponding to a recognition result when each word is recognized, for example, a
command ID is stored. In the present embodiment, as speech recognition in the headset, speech
control by word recognition is described as an example, but the present invention is not limited
to this. The voice recognition unit 13 in the headset performs voice recognition such as
continuous word recognition, sentence recognition, word spotting, voice intention understanding,
03-05-2019
11
etc. with small amount of operation, memory capacity, and low power consumption, and outputs
the result as a voice recognition result.
[0038]
The recognition model creation / storage unit 134 is output from the model matching unit 132
as a recognition result when the speech model of each word and each word become a recognition
result according to the recognition vocabulary stored in the recognition vocabulary storage unit
133. The word ID as an identification signal is stored in advance. Of course, when performing
recognition other than word recognition, the identification signal according to it is stored.
[0039]
The model matching unit 132 obtains the similarity or distance between each speech model of
the word to be recognized stored in the speech model creation / storage unit 134 and the feature
parameter series of the input speech, and the similarity is maximum. The word ID associated with
the speech model (or the distance is minimum) is output as the recognition result.
[0040]
As a matching method of the model matching unit 132, a method of representing a speech model
as a feature parameter sequence and obtaining a distance between the feature parameter
sequence of the speech model and the feature parameter sequence of the input speech by DP
(dynamic programming), Speech models are represented using HMMs (hidden Markov models),
and methods are widely used to calculate the probability of each speech model when the feature
parameter sequence of the input speech is input. It doesn't matter.
The word ID output from the model matching unit 132 is output as the recognition result of the
speech recognition unit 13 as it is.
[0041]
FIG. 9 shows the result of a recognition experiment when speech recognition is performed by the
speech recognition headset of the present invention. In the conventional method, when the input
03-05-2019
12
voice is recognized using only one microphone in which the headset is disposed, the method
according to the present invention performs microphone array processing with the audio input
from two microphones disposed in the headset. This is the result when the emphasized voice is
input to the voice recognition device. The recognition rate of speech recognition is improved, as
can be seen that errors have been reduced by approximately 6% with respect to telephone bell
noise.
[0042]
In the conventional method, the evaluation result when the sound source direction detection
function by the microphone array signal processing unit 10 of the headset of the present
invention is used to reject the input of speech and noise not subject to speech recognition In the
case of a 3.3 letter, and in the case of a telephone bell, a misplacement error of 1.0 letter has
occurred. By using the sound source direction detection function of the headset according to the
present invention, it is possible to prevent any misplacement error.
[0043]
Here, the evaluation data was used by a person 80 cm next to the speaker to be recognized
generating and collecting four types of noise and superimposing it on the voice of the recognition
speaker.
There are four types of noise: voice, telephone bell, paper-flip sound, and keyboard sound.
[0044]
As described above, since a plurality of microphones are arranged on the same circumference
centering on the mouth by the above configuration, the distance from the mouth as the sound
source to each microphone is equal, so the voice signal of the microphone array system is
emphasized. In this case, speech can be emphasized with a simple configuration without the need
for a delay unit, and the speech recognition rate can be further improved. Further, the speech
recognition rate can be maintained without being limited to the speaker position. Further,
although the directivity of the microphones is not particularly described here, when the
directivity of each microphone is disposed toward the mouth, it is possible to further emphasize
the voice recognition voice signal and improve the recognition rate. .
03-05-2019
13
[0045]
FIG. 10 shows the appearance of the voice recognition headset according to the second
embodiment of the present invention. FIG. 10 (a) is a front view of the voice recognition headset,
and FIG. 10 (b) is a side view seen from the arrow direction of FIG. 10 (a). The same components
as those of the first embodiment described above are denoted by the same reference numerals.
The voice recognition headset detects the sound emitted from the support frame 1, the ear pads
2 and 3, the speakers 4 and 5, the arm 6, and the wearer (user) of the headset to generate an
electrical sound signal. Microphones 7A and 7B, and a signal processing module 10 for voice
recognition through digital conversion of the voice signal. Although two microphones are used in
FIG. 10 for the sake of simplicity, it is also possible to arrange and implement three or more
microphones.
[0046]
This headset has a shape in which the left and right ear pads 2 and 3 are connected by a flexible
support frame 1, and is used by being attached to the head of the user. An arm 6 extends from
one of the ear pads 2, and the microphones 7A and 7B are disposed on the arm 6. The
microphones 7A and 7B are located on the same spherical surface 9B centered about the user's
mouth 8 when the user wears the headset, and the microphones are fixed to the arm 6 so that
the distances from the mouth are equal. .
[0047]
In the earpiece 2, speakers 5 (left and right) and a signal processing module 16 are incorporated.
The signal processing module 16 may be built in the earpiece 3. Although not shown, each
element is connected by a cable as needed.
[0048]
The configuration of the signal processing module 10 from voice signals detected by the
microphones 7A and 7B to voice recognition is the same as that of the first embodiment, and
thus the description thereof is omitted here. As described above, since a plurality of microphones
03-05-2019
14
are arranged on the same spherical surface centering on the mouth, the distance from the mouth
to each microphone is equal as in the first embodiment, and therefore, when emphasizing the
voice signal of the microphone array system. The speech can be emphasized with a simple
configuration without the need for a delay device, and the speech recognition rate can be further
improved. Further, the speech recognition rate can be maintained without being limited to the
speaker position.
[0049]
BRIEF DESCRIPTION OF THE DRAWINGS The speech recognition headset external view of
Example 1 of this invention. FIG. 2 is a block diagram showing the configuration of a signal
processing module. FIG. 2 is a block diagram showing an example of an internal configuration of
a microphone array processing unit. The block diagram which shows another example of an
internal structure of a microphone array process part. The block diagram which shows an
example of an internal structure of a beam former process part. FIG. 2 is a block diagram
showing an example of an internal configuration of a voice emphasizing unit. The flowchart
which shows the processing procedure of the voice emphasis part. The block diagram which
shows an example of an internal structure of a speech recognition part. The figure which shows
the evaluation result of recognition performance. The speech recognition headset external view
of Example 2 of this invention.
Explanation of sign
[0050]
DESCRIPTION OF SYMBOLS 1 ... Support frame 2, 3 ... Ear putting 4, 5 ... Speaker 6 ... Arm 7A, 7B
... Microphone 8 ... Mouth 10 ... Signal processing module 11A, 11B .. · AD converter 12 · · ·
Microphone array signal processing unit 13 · · · Speech recognition unit
03-05-2019
15
1/--страниц
Пожаловаться на содержимое документа