JP2012037603

Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2012037603
An object of the present invention is to suppress the reflection of a speech section such as a
vowel section or a low voice section in a noise model in noise estimation. A noise estimation
apparatus (10) calculates a correlation level of a spectrum between a plurality of frames in sound
information acquired by one or more microphones, a sound level of at least one target frame in
the sound information Using the power calculation unit 7 for calculating the power value to be
expressed, the power value of the target frame, and the correlation value between the frames
including the target frame, how much the target frame is for the noise model recorded in the
recording unit 12 The update determination unit 8 determines the update degree indicating
whether to reflect or the necessity of updating the noise model, and the update unit 9 reflects
sound information on the noise model according to the determination. [Selected figure] Figure 1
Noise estimation apparatus, noise estimation method and noise estimation program
[0001]
The present invention relates to a technique for estimating a noise model that can be used for
noise suppression processing (noise canceller) on sound acquired by a microphone.
[0002]
Conventionally, for noise suppression processing of a sound signal received by a microphone,
there has been disclosed a method of determining whether the section targeted by the input
sound signal is a voice section or whether it is steady or not.
03-05-2019
1
For example, as a method of determining whether a frame including a signal representing a back
sound is stationary or non-stationary, the number of consecutive frames when the change in
spectrum is small is measured, and the value is equal to or more than a threshold. There is a
method of determining as noise (see, for example, Patent Document 1 below).
[0003]
Further, as a method of evaluating whether or not it is a voice section, there is a method of using
the correlation coefficient of the spectrum between adjacent frames (for example, see Patent
Document 2 below). Alternatively, it has been disclosed that a correlation coefficient is used as a
stationary / non-stationary feature amount for automatically discriminating an acoustic signal
(see, for example, Patent Document 3 below).
[0004]
Also, as a conventional noise suppression process, a method of suppressing noise by subtracting
the value of the noise bias from the spectrum (spectral subtraction method: see, for example,
Patent Document 4 below), the target value of the estimated noise is after noise suppression In
the case where the spectrum is larger than the above spectrum, there is a method of suppressing
the distortion of the output signal by correcting the spectrum after noise suppression to a target
value (for example, see Patent Document 5 below). Thus, in the noise suppression process, the
estimated value of noise is used in various applications.
[0005]
Patent Document 1: International Publication No. 8-505715 Patent Document 2: International
Publication 2004/111996 Pamphlet Japanese Patent Application Laid-Open No. 2004-240214
US Patent No. 4,897,878 Japanese Patent Application Laid-Open No. 2007-183306
[0006]
Here, in order to create a noise model which is data representing the estimated noise, it is
effective to use sound information in a noise section of the input sound.
03-05-2019
2
Therefore, for example, a method may be considered in which it is determined whether the
section to be processed in the input signal is stationary or non-stationary, or a voice section, and
the noise model is estimated based on the determination result and the input signal.
[0007]
However, when a plurality of vowel sections (especially long vowels) and a section in which a
small voice is speaking continue, the power spectrum tends to be constant in these sections.
Therefore, when the power spectrum of such a vowel segment or a small voice segment is
calculated using the above-described conventional technology to determine whether the noise is
stationary noise or not, the segment may be determined as stationary noise. According to this
judgment, if the noise model is updated using the speech spectrum of the vowel section or the
small voice section and the noise suppression processing is executed using such a noise model,
the speech component in the vowel section or the small voice section is suppressed as noise I
will. Therefore, an object of the present invention is to suppress that a voice section such as a
vowel section or a low voice section is reflected in a noise model in noise estimation.
[0008]
A noise estimation apparatus according to the present disclosure includes a correlation
calculation unit that calculates a correlation value of a spectrum between a plurality of frames in
sound information acquired by one or more microphones, and power representing a sound level
of at least one target frame in the sound information. How much is the target frame reflected on
the noise model recorded in the recording unit using the power calculation unit that calculates
the value, the power value of the target frame, and the correlation value between the frames
including the target frame And an update unit for reflecting the sound information on the noise
model according to the determination.
[0009]
According to the disclosure of the present specification, it is possible to suppress that a speech
segment is reflected in the noise model in the noise estimation process.
[0010]
FIG. 1 is a functional block diagram showing a configuration of a noise suppression device
03-05-2019
3
including the noise estimation device according to the first embodiment.
FIG. 2 is a flowchart showing an operation example of the noise estimation apparatus.
FIG. 3A is a diagram showing an example of spectra of two consecutive frames in a vowel section.
FIG. 3B is a diagram showing an example of spectra of two consecutive frames in a stationary
noise interval. FIG. 4A is a diagram for explaining a modification of the calculation of the update
degree. FIG. 4A is a diagram for explaining a modification of the calculation of the update degree.
FIG. 5 is a functional block diagram showing a configuration of a noise suppression device
including the noise estimation device according to the second embodiment. FIG. 6 is a flowchart
showing an operation example of the noise estimation apparatus.
[0011]
First Embodiment [Configuration Example of Noise Suppression Device 20] FIG. 1 is a functional
block diagram showing a configuration of a noise suppression device 20 including the noise
estimation device 10 according to the first embodiment. The noise suppression device 20
illustrated in FIG. 1 is a device that acquires sound information from the microphone 1 and
outputs an audio signal in which noise is suppressed. The noise suppression device 20 can be
provided, for example, in a mobile phone, a car navigation device with a voice input function, or
the like. The application of the noise estimation device 10 and the noise suppression device 20 is
not limited to the above example, and can be provided in an apparatus having a function of
receiving voice from other users.
[0012]
The noise suppression device 20 includes a sound information acquisition unit 2, a frame
processing unit 3, a spectrum calculation unit 4, a noise estimation device 10, and a noise
suppression unit 11.
[0013]
The sound information acquisition unit 2 converts an analog signal received by the microphone 1
mounted in the housing into a digital signal.
03-05-2019
4
It is preferable to apply an LPF (referred to as an anti-aliasing filter) according to the sampling
frequency to the analog sound signal before AD conversion. The sound information acquisition
unit 2 may include an AD converter.
[0014]
The frame processing unit 3 frames the digital signal. Thereby, the sound waveform represented
by the digital signal is divided into a plurality of frame units in time series and cut out. In the
framing process, for example, a process of extracting and analyzing a section of a predetermined
sample length (referred to as a frame length) advances a section to be analyzed while
overlapping the sections by a predetermined length (referred to as a frame shift length) It can be
a process that is repeatedly executed. As an example, the frame length can be about 20 to 30 ms,
and the frame shift length can be 10 to 20 ms. A weight called analysis window is applied to the
taken out frame. As an analysis window, for example, a Hanning window, a Hamming window,
etc. are often used. The framing process is not limited to a specific one, and various other
methods used in voice information processing and sound processing can be used.
[0015]
The spectrum calculation unit 4 calculates the spectrum of each frame by executing the FFT of
each frame of the sound waveform. The spectrum calculating unit 4 may process the waveforms
of a plurality of bands obtained by the filter bank in the time domain using a filter bank instead
of the FFT. Also, instead of FFT, transform from other time domain to frequency domain (wavelet
transform etc.) may be used.
[0016]
As described above, the sound information received by the microphone 1 by the sound
information acquisition unit 2, the frame processing unit 3, and the spectrum calculation unit 4 is
noise estimation device 10 as data of a spectrum or waveform for each frame (each analysis
window). It will be available at The noise estimation device 10 receives data of the spectrum or
waveform of each frame, and updates the noise model recorded in the recording unit 12.
Thereby, the noise model is updated according to the sound information acquired by the
microphone 1.
03-05-2019
5
[0017]
The noise suppression unit 11 performs noise suppression processing using a noise model. The
noise model is, for example, data representing an estimate of a noise spectrum, and more
specifically, can be an average value of a spectrum of ambient noise with a small time change.
The noise suppression unit 11 can calculate the spectrum from which the noise component is
removed by subtracting the value of the spectrum of the noise indicated by the noise model from
the value of the spectrum of each frame calculated by the spectrum calculation unit 4. It is
preferable that the noise model does not include non-stationary noise or speech information with
large time variation. By noise suppression processing using such a noise model, it is possible to
output a speech signal in which stationary noise is suppressed. The noise suppression process
using the noise model is not limited to the above example.
[0018]
[Configuration Example of Noise Estimation Device 10] The noise estimation device 10 includes a
spectrum change calculation unit 5, a correlation calculation unit 6, a power calculation unit 7,
an update determination unit 8, and an update unit 9.
[0019]
The spectrum change calculation unit 5 calculates the time change of the spectrum in at least a
partial section of the sound acquired by the microphone 1.
The spectrum change calculation unit 5 converts, for example, the complex spectrum of each
frame obtained by the FFT process in the spectrum calculation unit 4 into a power spectrum.
Then, the power spectrum of the previous frame is recorded, and the difference from the power
spectrum of the current frame is calculated. For example, the spectrum change calculation unit 5
calculates the difference between the power spectrum stored one frame before and the power
spectrum of the current frame. Thereby, the change of the power spectrum between frames can
be calculated.
[0020]
03-05-2019
6
Based on the time change of the spectrum calculated by the spectrum change calculation unit 5,
the update determination unit 8 can determine whether to perform the update in which the
sound of the section is reflected on the noise model. For example, if it is determined that the
spectrum of the current frame has changed by a predetermined value or more as compared to
the spectrum of the previous frame, the update determination unit 8 determines that the
information of the current claim does not need to be reflected in the noise model. be able to.
[0021]
The correlation calculation unit 6 calculates a correlation value of a spectrum between a plurality
of frames in sound information acquired by one or more microphones. The correlation value is a
value representing the degree of correlation of spectra between frames. For example, correlation
coefficients of spectra between temporally adjacent frames can be calculated as correlation
values. The present invention is not limited to the correlation coefficient between adjacent
frames, and for example, a sum or a representative value (for example, an average value) of
correlation coefficients over a plurality of frames may be calculated as the correlation value.
[0022]
The power calculator 7 calculates a power value representing the sound level of at least one
target frame. Thereby, the power value of the current frame is obtained. The power value of the
frame can be determined, for example, using the amplitude of the time-series waveform of the
sound in the frame. Specifically, the sum of squares of sample values in a frame can be calculated
as a power value. The calculation of the power value is not limited to this. The power calculation
unit 7 can also calculate the power value of the frame using, for example, the spectrum
calculated by the spectrum calculation unit 4.
[0023]
The update determination unit 8 uses the power value of the target frame and the correlation
value between the frames including the target frame to update how much the target frame is to
be reflected on the noise model recorded in the recording unit 12 Determine if the degree or
noise model needs updating. The update degree can be represented by, for example, a value
indicating an update speed, more specifically, a time constant, but is not limited thereto. The
03-05-2019
7
update unit 9 reflects the sound information acquired from the microphone on the noise model
according to the determination by the update determination unit 8.
[0024]
As described above, since the update determination unit 8 uses the power value of the target
frame and the correlation value between the frames including the target frame, the update
determination unit 8 appropriately determines the likelihood of the vowel section of the target
frame according to the sound level of the frame. Can. Therefore, the degree of updating or the
presence or absence of updating can be appropriately controlled in accordance with the vowel
segment likelihood of the target frame. Therefore, it is possible to suppress that sound
information of a vowel or short voice section is erroneously used for updating the noise model.
As a result, the noise model, which is data representing the estimation noise, can be suppressed
from including vowel and short voice components. In particular, in the case where the noise
model is a stationary noise model, it is highly likely that the sections of vowels and small voices
are erroneously determined to be stationary noise sections and used as updating of the
stationary noise model. For example, it is effectively suppressed that the sound information of the
vowel or short voice section is reflected to the stationary noise model.
[0025]
In the above configuration, the update determination unit 8 can determine the necessity of the
noise model update by comparing the correlation value with a threshold value. The threshold
value can be determined by the power value of the target frame calculated by the power
calculator 7. Specifically, the update determination unit 8 can control the parameters of the
process of determining the necessity of updating the noise model using the correlation value
according to the value of the current frame power. Thereby, for example, in the case of low frame
power (quiet environment, low voice) and high frame power (noise environment, normal
utterance), respectively, an optimal case for updating the noise model considering vowels
Parameter setting is possible.
[0026]
As described above, the noise model is updated using an estimated value such as a stationary
noise level or an SNR by controlling a threshold serving as a reference for necessity of updating
03-05-2019
8
the noise model using an absolute quantity such as a power value of a frame. It is possible to
make stable noise model estimation as compared to controlling. That is, an appropriate noise
model can be stably estimated.
[0027]
The update determination unit 8 can also determine the update degree of the noise model
according to the power value of the target frame. Specifically, the update determination unit 8
can control the update speed (for example, a time constant as an example) of the noise model in
accordance with the power value of the current frame calculated by the power calculation unit 7.
[0028]
As described above, by controlling the degree of updating using an absolute quantity such as the
power value of a frame, it is possible to estimate a stable noise model. For example, it is possible
to update the noise model with an optimal update degree in each of low frame power (quiet
environment, low voice) and high frame power (noise environment, normal utterance). As a
result, it is possible to stably estimate the noise model.
[0029]
[Operation Example of Noise Estimation Device 10] FIG. 2 is a flowchart showing an operation
example of the noise estimation device 10. The example shown in FIG. 2 is an example of a
process in which the noise estimation device 10 receives from the spectrum calculation unit 4
the spectrum in frame units of sound information received by the microphone 1 and updates the
noise model.
[0030]
First, the spectrum change calculation unit 5 calculates the difference (change amount) in power
spectrum between the previous frame and the current frame (Op1). If the value of the power
spectrum difference is less than or equal to the threshold value TPOW (Yes in Op 2), it is
determined that the current frame is likely to be stationary noise, and the noise model is updated
03-05-2019
9
using the power spectrum of the current frame. The processes (Op3 to Op9) are executed. Note
that in this Op2 determination process, for example, a voice with a small spectrum change, such
as long vowels and low voices, may also be determined as stationary noise. Therefore, in the
subsequent processes Op3 to Op8, the speech with small spectrum change is controlled not to be
used for updating the noise model. On the other hand, when the value of the power spectrum
difference exceeds the threshold value TPOW (No in Op 2), that is, when the change of the
spectrum of the current frame from the previous frame is large, it is determined that the current
frame is not stationary noise. The power spectrum of the frame is not used to update the noise
model.
[0031]
In the case of Yes in Op2, the power calculation unit 7 calculates the power value of the current
frame (Op3). The power value of the current frame is a value representing the level of the input
sound. For example, the power calculation unit 7 can calculate the power value using the
waveform of the current frame cut out by the frame processing unit 3. As a specific example, the
power of the frame can be determined by the following equation (1), where N samples in the
frame are x (n), as described below.
[0032]
[0033]
In the above equation, for example, if the sampling rate is 8 kHz and the frame length is 32 ms,
the value of N is 256.
The conversion to the dB unit is to facilitate adjustment of the threshold value to determine
whether the frame power is high or low.
[0034]
The update determination unit 8 determines whether the power value of the current frame
calculated by the power calculation unit 7 is smaller than the threshold value Th1 (Op 4). The
03-05-2019
10
threshold Th1 is an example of a threshold for determining whether the current frame is low
frame power or high frame power. The threshold Th1 can be recorded in advance in the
recording unit 12, and can be set to, for example, 50 dBA (frame power value when the noise
level is the A characteristic).
[0035]
The update determination unit 8 controls parameters in the noise model update process
according to the power value of the current frame. Specifically, a threshold value for determining
whether or not to update the sound model (vowel detection), and a parameter (referred to as a
time constant) for controlling the degree of update are determined by the power value of the
current frame.
[0036]
Table 1 below is an example of parameter values in the noise model update process. At low frame
power, the power value of the current frame is smaller than the threshold Th1. At high frame
power, the power value of the current frame may be higher than the threshold Th1. The
threshold Th2 of the correlation coefficient is an example of the threshold for determining
whether or not the noise model needs updating, by determining whether or not it is a vowel
segment using the correlation coefficient between the immediately preceding frame and the
current frame. It is. The time constant is an example of a value indicating the update speed of the
noise model.
[0037]
[0038]
At low frame power, both the correlation coefficient of the noise segment and the correlation
coefficient of the low voice segment tend to be small, so the threshold value Th2 is set smaller
compared to that at high frame power as in the example of Table 1 above. It is preferable to do.
Conversely, at high frame power, the correlation coefficient in the noise section is large, so it is
03-05-2019
11
preferable to set the threshold value larger than at low frame power. The values of the threshold
value Th2 and the time constant can be recorded in the recording unit 12 in advance.
[0039]
In addition, at low frame power, it is a quiet environment where the level of stationary noise is
small. In such an environment, the ratio of the speech component to the estimated value of the
noise model when the speech section is erroneously updated as the stationary noise section is
growing. As a result, in the noise suppression using the noise model, the speech is considered as
stationary noise and suppressed, and the influence on the distortion of the speech after the noise
suppression becomes large. Therefore, as in the example of Table 1 above, it is possible to
increase the time constant of updating of the noise model at low frame power and to delay
updating. Even if the time constant is increased and the speech is erroneously determined as a
stationary noise section, the ratio of the speech to the estimated value of the noise model can be
reduced. As a result, the adverse effect of voice distortion can be suppressed. The time constant
can be set based on preliminary experiments. The closer the time constant is to 1, the slower the
update speed.
[0040]
In the example shown in FIG. 2, when the current frame power is determined to be equal to or
higher than the threshold value Th1 at Op4, that is, when the current frame is determined to be a
high frame power interval, Th2 = 0.7, time constant = It is set to 0.9 (normal) (Op 5). If it is
determined that the current frame is a low frame power interval (No in Op 4), Th 2 = 0.5, time
constant = 0.999 (update speed slower than normal) is set (Op 6).
[0041]
In the present embodiment, although the setting of the noise model update parameter according
to the current frame power is performed by the branch of the process of the update
determination unit 8, the control method of the noise model update is not limited to this. For
example, data or a function that associates the value of the current frame power with the set of
correlation coefficient and time constant is recorded in the recording unit 12 and the update
determination unit 8 refers to this data or By executing the processing of the function, it is
possible to determine the parameter according to the current frame power. Also, the evaluation
03-05-2019
12
of the power value of the current frame is not limited to the two stages of low frame power and
high frame power as in the above example, but can be evaluated in three or more stages.
[0042]
Next, the correlation calculation unit 6 calculates the correlation coefficient of the spectrum
between the immediately preceding frame and the current frame, and determines that it is a
vowel section if the threshold is exceeded and a stationary noise section if it is below (Op7, Op8).
The correlation coefficient can be calculated, for example, by the following equation (2).
[0043]
[0044]
In the above example, the correlation coefficient takes a value of -1 to 1.
As the correlation coefficient is closer to 1, the correlation is higher and as it is closer to 0, there
is no correlation. (It can be said that there is inverse correlation if the correlation coefficient is
close to -1). FIG. 3A is a diagram showing an example of spectra of two consecutive frames in a
vowel segment, and FIG. 3B is a diagram showing an example of spectra of two consecutive
frames in a stationary noise segment. In FIGS. 3A and 3B, line P represents the spectrum of the
previous frame, and line C represents the spectrum of the current frame. The correlation
coefficient of the spectrum between the two frames shown in FIG. 3A is 0.84, and the correlation
coefficient of the spectrum between the two frames shown in FIG. 3B is -0.09. As described
above, in the vowel section, the spectrum has a tendency to change relatively slowly over a
plurality of frames, so the shape of the spectrum of two consecutive frames has a small change
and the correlation coefficient becomes a high value of 0.84. There is. On the other hand, in the
stationary noise section, the spectrum shape between two consecutive frames is not similar and
the correlation coefficient is close to 0 because they arrive randomly from the surroundings.
[0045]
In the present embodiment, the correlation between the previous frame and the current frame is
03-05-2019
13
obtained, but when the frame shift length is short (for example, in the case of 5 ms or 10 ms), the
correlation coefficient with the frame two frames before in the vowel section is also obtained.
Because of the increase, it is also possible to use a correlation coefficient with a frame two frames
past for vowel detection. Thus, the frame used to calculate the correlation coefficient is not
limited to the current frame and the immediately preceding frame.
[0046]
If the correlation coefficient is smaller than Th2 (Yes in Op8), the update determining unit 8
determines that the current frame is a noise section, and determines to update the noise model
using the current frame. If the correlation coefficient is Th2 or more (No in Op 8), it is decided
not to update the noise model. That is, the update determination unit 8 compares the correlation
coefficient between the current frame calculated at Op 7 and the spectrum between the previous
frames with the threshold value Th2, and if the correlation coefficient falls below the threshold
value Th2, the stationary noise interval If it exceeds, it can be judged as a vowel section. This
correlation coefficient can be calculated by the above equation for a plurality of frequency bands
and compared with the threshold Th2 for each frequency band. A threshold may also be provided
for each frequency band. The noise model can be updated according to the set time constant only
for the frequency band determined to be a stationary noise interval.
[0047]
In the case of Yes in Op 8, the updating unit 9 updates the noise model with the time constant
determined in Op 5 or Op 6 using the spectrum of the frame determined to be the stationary
noise section. For example, when the time constant is α, the value S (ω) of the power spectrum
of the current frame is used, and the noise model model (ω) at the frequency ω is updated for
each frequency using the following equation (3) be able to. This process corresponds to
averaging the noise model.
[0048]
[0049]
In addition, the update process of a noise model is not restricted to the process using said
03-05-2019
14
Formula (3).
For example, the time constant α can use the value α (ω) set for each frequency. Further, in the
above process, the noise model is not updated as the vowel section when the correlation
coefficient exceeds the threshold value Th2, but the update is performed when the correlation
coefficient exceeds the threshold value. The processing of the updating unit 9 may be executed
after setting the time constant of the above to 1.0 (a value at which substantial updating is not
performed).
[0050]
The above-described processing of Op1 to Op9 is repeated until all frames are completed (until it
is determined as Yes in Op10). That is, the processing of Op1 to Op9 is sequentially performed
for each frame aligned on the time axis.
[0051]
As described above, in the operation example illustrated in FIG. 2, the threshold value used to
determine whether the noise model is updated using the correlation coefficient according to the
value of the current frame power calculated at Op 3, and the noise model update Control the
speed and. Thereby, the influence of the vowel section on the noise model can be suppressed. In
the above operation example, not only vowel detection based on the correlation coefficient of the
spectrum is simply used for noise model estimation, but also the threshold value of noise model
update necessity judgment and time constant of noise model update according to the current
frame power. It is switching. This is based on the finding that the optimal threshold and the
update degree of the noise model differ depending on the value of the current frame power.
[0052]
If it is a method of switching the threshold value and the noise model update processing using
the estimated value of the noise model or the SNR (difference between the input sound and the
noise model), the estimated value is used to estimate the noise, so stable operation Can not
guarantee. On the other hand, by using the absolute amount of the current frame power as in the
above embodiment, stable noise estimation processing independent of the estimation processing
03-05-2019
15
result is possible.
[0053]
Modified Example FIGS. 4A and 4B are diagrams for explaining a modified example of the
calculation of the update degree by the update determining unit 8. FIG. 4A shows an example of
the relationship between the correlation coefficient and the time constant in the case of low
frame power. FIG. 4B shows an example of the relationship between the correlation coefficient
and the time constant in the case of high frame power. In the example shown in FIG. 4A and FIG.
4B, the threshold value of the correlation coefficient is set at two places (Th2-1, Th2-2). When
the correlation coefficient is equal to or higher than the upper threshold Th2-2, the update
determination unit 8 sets the update time constant to 1.0 and stops the update of the noise
model. If the correlation coefficient is less than or equal to the lower threshold Th2-1, the time
constant is set to 0.999. When the correlation coefficient is between the lower threshold Th2-1
and the upper threshold Th2-2, the update determination unit 8 continuously updates the time
constant according to the value of the correlation coefficient. Determine the time constant to
increase. Thus, it is possible to provide a gray zone between the two threshold values Th2-1, Th22.
[0054]
Further, when the correlation coefficient becomes the value of the region not to be updated, for
example, the update determination unit 8 forcibly updates the time constant even if the value of
the correlation coefficient falls below Th2-2 in the subsequent six frames. Can be set to 1.0. As a
result, when the update determination unit 8 determines that the noise model update is
unnecessary, the update unit 9 can prevent the noise model from being updated for the frames
within a predetermined time from the target frame.
[0055]
That is, when the update determination unit 8 determines that the current frame is a voice period
using the correlation coefficient, the update rate of the voice period is forced to update the noise
model over several frames after the current frame. It can apply. As a result, it is possible to
suppress the use of speech sections such as phoneme-phoneme transitions, consonant sections,
and the like in which vowel-likeness hardly occurs, for updating the noise model. Thus, by
03-05-2019
16
providing a so-called guard frame, it is suppressed that crossover between different vowels tends
to decrease the value of the correlation coefficient or erroneous use of a consonant as a
stationary noise section for updating the noise model. Ru.
[0056]
Second Embodiment FIG. 5 is a functional block diagram showing a configuration of a noise
suppression device 20a including a noise estimation device 10a according to a second
embodiment. In FIG. 5, the same blocks as in FIG. 1 are given the same reference numerals. The
noise suppression device 20a shown in FIG. 5 receives sound information received from the two
microphones 1a and 1b.
[0057]
Although the form of the microphones 1a and 1b is not limited to a specific one, here, as an
example, the case where the microphones 1a and 1b are configured by microphone arrays
equipped with microphones on the front and back of the mobile phone will be described. The
sound information acquisition unit 2 receives an analog signal received by the two microphones
1a and 1b. The analog signals of the two microphones 1a and 1b are respectively converted to
digital signals after being applied to the anti-aliasing filter. The frame processing unit 3 and the
spectrum calculation unit 4 perform the framing process and the power spectrum calculation
process on each of the digital signals received by the two microphones 1a and 1b as in the first
embodiment. Run.
[0058]
[Configuration Example of Noise Suppression Device 20a] The noise estimation device 10a
further includes a level difference calculation unit 13 that calculates the level difference between
the microphones from the sound information acquired by the two microphones 1a and 1b. The
level difference calculation unit 13 receives, for example, the spectrum of each channel of the
microphones 1a and 1b from the spectrum calculation unit 4, and calculates the power spectrum
of each frame for each channel. Thus, the sound level can be calculated for each frame for each
channel of the microphones 1a and 1b. By calculating the difference between the sound level of
the channel of the microphone 1a and the sound level of the channel of the microphone 1b for
each frame and for each frequency, it is possible to calculate the level difference between the
03-05-2019
17
channels of the microphone for each frame and for each frequency. Alternatively, the sound level
of the entire band (for example, 0 to 4 kHz in the case of 8 kHz sampling) can be calculated not
for each frequency from the waveform signal of sound information in each channel of the
microphones 1a and 1b. In this case, the level calculation of the sound of the frame can be
performed in the same manner as the power value calculation of the current frame of the power
calculation unit 7 in the first embodiment.
[0059]
The update determination unit 8a further uses the level difference calculated by the level
difference calculation unit 13 to determine the update degree of the noise model or the necessity
of update. With this configuration, the update determination unit 8 can determine the likelihood
of voice uttered in the vicinity of the microphone according to the level difference between the
sounds received by the two microphones. Therefore, for example, the update speed of the noise
model can be controlled based on the speech likeness uttered near the microphone. Specifically,
the update determination unit 8a determines that the section of the level difference between the
two microphones is larger than the threshold as the section of the voice uttered near the
microphones, and the noise is determined according to the determination. The time constant
indicating the degree of model update can be appropriately controlled. Therefore, it is possible to
suppress that the component of speech is included in the noise model.
[0060]
Furthermore, the noise estimation device 10a further includes a phase difference calculation unit
14 that calculates the phase difference between the microphones from the sound information
acquired by the two microphones 1a and 1b. The phase difference calculating unit 14 receives
the complex spectrum of each channel of the microphones 1a and 1b from the spectrum
calculating unit 4, and the phase difference between the complex spectrum of the channel of the
microphone 1a and the complex spectrum of the channel of the microphone 1b is And calculate
for each frequency. Thereby, the phase difference spectrum between the channels of the
microphones 1a and 1b can be calculated. From the phase difference spectrum, for example, the
arrival direction of the sound (the direction of the sound source) can be determined for each
frequency.
[0061]
03-05-2019
18
The update determination unit 8a further uses the phase difference calculated by the phase
difference calculation unit 14 to determine the update degree of the noise model or the necessity
of update. The update determination unit 8a may, for example, determine the speech likeness
uttered from the user's mouth direction based on the phase difference, and control the update
speed of the noise model based on the speech likeness uttered from the user's mouth direction. it
can. As described above, the time constant of updating the noise model can be appropriately
controlled based on the speech likeness obtained from the phase difference between the two
microphones. As a result, it is possible to suppress that the noise model contains a voice
component uttered from the user's mouth direction.
[0062]
In the example shown in FIG. 5, the level difference calculation unit 13 and the phase difference
calculation unit 14 respectively receive the spectra of the channels of both the microphone 1a
and the microphone 1b. On the other hand, the power calculation unit 7, the spectrum change
calculation unit 5, the correlation calculation unit 6, and the noise suppression unit 11 can
receive and process the spectrum of either channel of the microphone 1a or the microphone 1b.
For example, in the case of a mobile phone, among the microphone 1a and the microphone 1b,
only the channel signal of the microphone provided on the front of the mobile phone can be
converted into the power calculator 7, the spectrum change calculator 5, the correlation
calculator 6, and the noise suppressor 11 can be used.
[0063]
In the example shown in FIG. 5, the noise estimation device 10 a includes both the level
difference calculation unit 13 and the phase difference calculation unit 14, but may be
configured to include at least one of these. In addition, the update determination unit 8a is
configured to switch whether to further use a level difference and / or a phase difference to
determine the update degree or the update necessity according to the power value calculated by
the power calculation unit 7. You can also. Thereby, for example, according to the current frame
power value, it is switched whether information on speech likeness uttered in the vicinity and
speech likeness uttered from the user's mouth direction is used to control the updating degree of
the noise model. It will be possible. As a result, it is possible to update the optimal noise model in
each of the low frame power (quiet environment and low voice) and the high frame power (noise
environment and normal utterance). As a result, it is possible to stably estimate the noise model.
03-05-2019
19
[0064]
[Operation Example of Noise Estimation Device 10a] FIG. 6 is a flowchart showing an operation
example of the noise estimation device 10a. In FIG. 6, the same processes as the processes shown
in FIG. The operation shown in FIG. 6 is obtained by adding user speech detection processing (Op
41 to Op 44) at high frame power (when Yes in Op 4) to the operation of the first embodiment
shown in FIG. ing.
[0065]
In the example shown in FIG. 6, when the current frame power is equal to or less than the
threshold value Th1, the level difference calculation unit 13 calculates the level difference of the
sound between the microphones (Op 41), and the update determination unit 8a includes two
microphones. Using the information on the level difference between them and determine whether
it is the voice section of the current frame (Op 42).
[0066]
For example, when the user utters in the vicinity of the microphone, a difference occurs in the
level of the microphone closer to the mouth and the microphone farther.
Using this, at Op42, if there is a level difference between the two microphones, the update
determination unit 8a does not use the spectrum of that frame as nearby speech for updating the
noise model.
[0067]
Specifically, update is performed when the level difference between the sound level of the
current frame of the channel of the microphone 1a and the sound of the current frame of the
channel of the microphone 1b is larger than Th3 and smaller than Th4 (Yes in Op42) The
determination unit 8a can determine that the current frame is not a speech segment. In the case
of No in Op42, the current frame is determined to be a speech segment, and it is determined that
the noise model is not updated in the current frame. Here, two threshold values Th3 and Th4
(Th3 <Th4) are provided. For example, Th3 is a threshold value for determining whether or not
03-05-2019
20
the voice section is generated by speech in the vicinity of the front microphone, and Th4 is
determined whether or not the voice section is generated by speech in the vicinity of the rear
microphone. It can be a threshold.
[0068]
In the case of Yes in Op42, the phase difference calculation unit 14 calculates the phase
difference between the microphones (Op 43), and the update determination unit 8a uses the
information on the phase difference between the two microphones to make the speech interval
like of the current frame. To determine (Op 44).
[0069]
For example, when the arrival direction of the sound estimated from the phase difference
between the channels of the microphones 1a and 1b is the user's mouth direction, for example,
the spectrum of the frame of the section by the operation of Op 43 and Op 44 Can not be used
for updating the noise model as user speech.
Specifically, if the average phase difference between the channels of the microphones 1a and 1b
in the section including the current frame is larger than the threshold Th5 (Yes in Op44), the
current frame may be a noise section. It is determined that the noise model is updated (Op 5 and
on). In the case of No in Op44, the current frame is determined to be a voice section, and it is
determined not to update the noise model in the current frame. For example, Th5 can be a
threshold for detecting an utterance from the front of the user.
[0070]
In the example shown in FIG. 6, at the time of low frame power (in the case of No in Op 4),
control is performed so as not to perform user voice detection processing (Op 41 to Op 44)
based on information of level difference and phase difference between two microphones. ing. As
a result, since the user voice at low frame power is a low voice, the SNR is bad and the level
difference and the phase difference are easily disturbed, and it can be avoided that the user voice
can not be detected stably.
[0071]
03-05-2019
21
Further, in the example shown in FIG. 6, the level difference calculated by the level difference
calculating unit 13 and the phase difference spectrum calculated by the phase difference
calculating unit 14 are obtained for each frequency. Therefore, it is possible to determine
whether to update the noise model for each frequency by comparing with the thresholds Th3,
Th4 and Th5 for each frequency.
[0072]
As described above, according to the present embodiment, the phase difference indicating the
mouth direction of the user obtained from the voice information from the two microphones and
the level difference indicating the distance between the microphone and the mouth are used to
determine the voice section. be able to. As a result, it is possible to suppress the use of the user
voice component for updating the noise model. The number of microphones is not limited to two,
and even in a configuration having three or more microphones, the sound level difference and
phase difference between the microphones can be similarly calculated and used for updating
control of the noise model.
[0073]
[Computer Configuration, Others] The noise suppression devices 20 and 20a and the noise
estimation devices 10 and 10a in the first and second embodiments can be embodied using a
computer. The computer which comprises the noise suppression apparatus 20 and 20a and the
noise estimation apparatus 10 and 10a is provided with memory, such as processors, such as
CPU, DSP (Digital Signal Processor), ROM, and RAM, at least. Sound information acquisition unit
2, frame processing unit 3, spectrum calculation unit 4, noise estimation device 10, noise
suppression unit 11, spectrum change calculation unit 5, correlation calculation unit 6, power
calculation unit 7, update determination unit 8, 8a The respective functions of the updating unit
9, the level difference calculating unit 13, and the phase difference calculating unit 14 can also
be realized by the CPU executing a program recorded in the memory. In addition, each function
described above can be realized by one or more DSPs in which programs and various data are
incorporated. The recording unit 12 can be realized by a memory accessible by the noise
suppression devices 20 and 20a.
[0074]
03-05-2019
22
A computer readable program for causing a computer to execute each of these functions and a
recording medium recording the program are also included in the embodiments of the present
invention. This recording medium is non-transitory and does not include transitive media such as
the signal itself.
[0075]
Note that electronic devices such as mobile phones and car navigation systems in which the
noise suppression devices 20 and 20a and the noise estimation devices 10 and 10a are
incorporated are also included in the embodiments of the present invention.
[0076]
According to the first and second embodiments, it is possible to discriminate vowel segments and
small voice segments which are difficult to discriminate only by the method using the time
change of spectrum and not to be used for updating the noise model. .
Therefore, distortion of the processed speech can be suppressed by the noise suppression
processing using the noise model.
[0077]
Reference Signs List 1 microphone 2 sound information acquisition unit 3 frame processing unit
4 spectrum calculation unit 10 noise estimation device 11 noise suppression unit 12 recording
unit 13 level difference calculation unit 14 phase difference calculation unit 20 noise
suppression device
03-05-2019
23