JP2017067990

Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2017067990
PROBLEM TO BE SOLVED: To provide a voice processing device capable of executing a flooring
process capable of suppressing disturbing voice without excess or deficiency. SOLUTION: A voice
processing apparatus according to the present invention acquires frequency domain input signals
converted from a plurality of microphones, and based on a difference between the frequency
domain input signals for each of the microphones, a front suppression signal having a dead angle
in front. Calculating a flooring threshold, generating a front suppression signal generation unit, a
coherence calculation unit calculating a coherence and a coherence filter coefficient, a feature
amount calculation unit calculating a feature of the front suppression signal, and the coherence;
A flooring threshold calculation unit and a filtering processing unit that applies the calculated
flooring threshold to the coherence filter coefficient to perform a flooring process, and then
suppresses the disturbing voice to obtain a signal after suppression And. [Selected figure] Figure
1
Voice processing apparatus, program and method
[0001]
The present invention relates to a voice processing apparatus, a program, and a method, and can
be applied to, for example, suppression processing of non-target sound (for example, disturbing
voice) other than target sound in voice processing in telephones and video conferences and voice
recognition processing. .
[0002]
03-05-2019
1
In recent years, devices supporting various voice processing functions such as voice
communication functions such as smartphones and car navigation and voice recognition
functions (hereinafter, these devices are collectively referred to as "voice processing devices")
have become widespread. ing.
However, with the spread of these voice processing devices, voice processing devices have come
to be used under harsher noise environments than before, such as in a crowded town or in a
moving car. Therefore, there is an increasing demand for a speech processing apparatus that can
maintain speech quality and speech recognition performance even in a noisy environment.
[0003]
In the conventional voice processing apparatus, when target sound is extracted and acquired,
processing for suppressing non-target sound other than the target sound is performed.
[0004]
Unexamined-Japanese-Patent No. 2015-26956
[0005]
By the way, as components usually contained in the non-purpose sound, for example,
background noise (for example, crowded in a town, running noise of a car, etc.) Speaking voice).
Heretofore, various effective suppression methods have been proposed on the premise that the
background noise has a stationary frequency characteristic and power.
On the other hand, the disturbing voice is a non-stationary signal power and frequency
characteristic, and is a human voice as well as a target voice (voice of a voice processing function
user). Therefore, in the conventional speech processing apparatus, when it is intended to detect
disturbing speech, it is difficult to determine the presence or absence based on the difference in
behavior with the target speech, such as background noise. For this reason, in the conventional
voice processing apparatus, when trying to suppress disturbed voices, excessive suppression
processing is performed regardless of the presence or absence of disturbed voices, distortion of
the sound quality becomes remarkable, and residual components of disturbed voices due to
insufficient suppression. Causes a problem that the speech quality and the speech recognition
03-05-2019
2
performance do not reach a predetermined level.
[0006]
Here, in view of the above problems (excessive suppression processing etc.), the conventional
speech processing device compares the suppression coefficient calculated based on the input
signal with a predetermined threshold, and the suppression coefficient is smaller than the
threshold. In some cases, a flooring process may be used in which a threshold is used as a
suppression coefficient without using the suppression coefficient (see Patent Document 1).
[0007]
However, if the threshold for flooring processing (hereinafter referred to as flooring threshold) is
too large, the suppression performance will be insufficient, and if it is too small, distortion of the
target voice will increase (that is, it will affect the sound quality).
[0008]
Therefore, there is a need for a voice processing apparatus, program, and method capable of
executing a flooring process that can suppress disturbing voice without excess or deficiency
without affecting the sound quality.
[0009]
A voice processing apparatus according to a first aspect of the present invention (1) acquires a
frequency domain input signal obtained by converting an input signal obtained from a plurality
of microphones from a time domain to a frequency domain, and acquires the frequency domain
input for each acquired microphone (2) Coherence calculation for calculating coherence and
coherence filter coefficients from input signals obtained from the plurality of microphones, and
(2) a front suppression signal generation unit for generating a front suppression signal having a
blind spot on the front based on a difference between signals And (3) a feature quantity
calculation unit that calculates a feature quantity that represents a relationship between the front
suppression signal and the coherence; and (4) a flooring threshold that calculates a flooring
threshold using the feature quantity. A flooring threshold is applied to the calculation unit and
(5) the coherence filter coefficient, using the flooring threshold calculation unit, After performing
grayed process, by suppressing the interference sound included in the input signal, and having a
filtering section for acquiring a suppression after signal.
[0010]
A voice processing program according to a second aspect of the present invention obtains a
03-05-2019
3
computer, (1) a frequency domain input signal obtained by converting an input signal obtained
from a plurality of microphones from a time domain to a frequency domain, A front suppression
signal generation unit that generates a front suppression signal having a blind spot on the front
based on a difference between frequency domain input signals, and (2) calculating coherence and
coherence filter coefficients from input signals obtained from the plurality of microphones
Calculating a flooring threshold using a coherence calculation unit, (3) a feature amount
calculation unit that calculates a feature amount representing a relationship between the front
suppression signal, and the coherence, and (4) using the feature amount. A flooring threshold
calculation unit, and (5) a flooring threshold calculated using the flooring threshold calculation
unit with respect to the coherence filter coefficient. By applying, after performing flooring
processing, by suppressing the interference sound included in the input signal, characterized in
that to function as a filter processing unit that acquires the suppression after signal.
[0011]
According to a third aspect of the present invention, there is provided an audio processing
method for suppressing disturbed speech from input signals obtained from a plurality of
microphones, comprising: a front suppression signal generator, a coherence calculator, a feature
calculator, a flooring threshold calculator, and a filter A processing unit is provided, and (1) the
front suppression signal generation unit acquires a frequency domain input signal obtained by
converting input signals obtained from a plurality of microphones from a time domain to a
frequency domain, and the acquired frequency for each of the microphones A front suppression
signal having a dead angle in front is generated based on the difference of the area input signal,
and (2) the coherence calculation unit calculates the coherence and the coherence filter
coefficient from the input signals obtained from the plurality of microphones. , (3) The feature
amount calculation unit calculates a feature amount representing a relationship between the
front suppression signal and the coherence, and (4) a flooring threshold The output unit
calculates a flooring threshold using the feature amount, and (5) the filter processing unit
calculates a flooring threshold calculated using the flooring threshold calculation unit for the
coherence filter coefficient. The method is characterized in that, after the flooring process is
performed, the disturbing voice included in the input signal is suppressed to obtain a signal after
suppression.
[0012]
According to the present invention, it is possible to provide a voice processing apparatus,
program and method capable of executing a flooring process that can suppress disturbing voice
without excess or deficiency without affecting the sound quality.
[0013]
03-05-2019
4
It is the block diagram shown about the functional composition of the speech processing unit
concerning an embodiment.
It is an explanatory view shown about an example of arrangement of a microphone concerning
an embodiment.
It is the figure (1) shown about the characteristic of the directional signal applied with the speech
processing unit concerning an embodiment.
It is the figure (2) shown about the characteristic of the directional signal applied with the speech
processing unit concerning an embodiment.
It is a flowchart (the 1) shown about an example of operation of a speech processing unit
concerning an embodiment.
It is the flowchart (the 2) shown about the example of operation of the speech processing unit
concerning an embodiment.
[0014]
(A) Main Embodiment Hereinafter, one embodiment of a speech processing apparatus, program
and method according to the present invention will be described in detail with reference to the
drawings.
[0015]
(A-1) Configuration of Embodiment FIG. 1 is a block diagram showing the overall configuration of
a speech processing apparatus 1 of this embodiment.
[0016]
The voice processing device 1 obtains input signals s1 (n) and s2 (n) from the pair of
03-05-2019
5
microphones m̲1 and m̲2 via an AD converter (not shown).
Here, n is an index representing the order of sample input, and is represented by a positive
integer.
In the text, the smaller n is the older input sample, and the larger n is the newer input sample.
[0017]
The voice processing device 1 performs processing for suppressing unintended sound (for
example, disturbing voice) included in input signals captured by the microphones m̲1 and m̲2.
The output format of the audio signal output from the audio processing device 1 is not limited,
and may be output as digital audio data of an arbitrary format, or may be output as an analog
audio signal.
In this embodiment, the audio processing device 1 will be described as outputting digital audio
data in, for example, a PCM (Pulse-code modulation) format in units of frames. The voice
processing device 1 is used, for example, for preprocessing (for example, suppression processing
of disturbing voice) of a voice signal used in a communication device such as a television
conference system or a mobile telephone terminal or in a voice recognition function.
[0018]
FIG. 2 is an explanatory view showing an example of the arrangement of the microphones m̲1
and m̲2.
[0019]
As shown in FIG. 2, in this embodiment, in the microphones m̲1 and m̲2, the plane including
the two microphones m̲1 and m̲2 is perpendicular to the direction in which the target sound
arrives (the direction of the sound source of the target sound). It shall be arranged.
03-05-2019
6
In the following, as shown in FIG. 2, the arrival direction of the target sound is referred to as the
forward direction or the front direction as viewed from the position between the two
microphones m̲1 and m̲2. Also, in the following, as shown in FIG. 2, when referred to as
rightward, leftward, and backward, each direction when the direction of arrival of the target
sound is viewed from the position between the two microphones m̲1 and m̲2 is indicated. It
explains as a thing. In this embodiment, it is assumed that the target sound comes from the front
direction of the microphones m̲1 and m̲2 and the non-target sound including the disturbing
sound comes from the lateral direction (lateral direction).
[0020]
The speech processing apparatus 1 includes an FFT unit 10, a front suppression signal
generation unit 20, a coherence calculation unit 30, a correlation calculation unit 40, a
coherence filter processing unit 50, and an IFFT unit 60.
[0021]
The voice processing device 1 may be realized by installing a program (a program including a
voice processing program according to the embodiment) in a computer having a processor, a
memory, etc., but even in this case, the voice processing device 1 functions In fact, it can be
shown using FIG.
Note that part or all of the speech processing apparatus 1 may be realized as hardware.
[0022]
The FFT unit 10 receives the input signal series s1 and s2 from the microphone m1 and the
microphone m2, and performs fast Fourier transform (or discrete Fourier transform) on the input
signals s1 and s2. Thus, the input signals s1 and s2 are represented in the frequency domain.
Note that the FFT unit 10 performs analysis on the analysis frame FRAME 1 (K), which is
composed of predetermined N (N is an arbitrary integer) samples from the input signals s 1 (n)
and s 2 (n) when performing fast Fourier transform. Assume that FRAME 2 (K) is configured. An
example of constructing FRAME 1 from input signal s 1 is shown in the following equation (1). In
the following equation (1), K is an index representing the order of frames, and is expressed by a
positive integer. In the following, the smaller the value of K, the older the analysis frame, and the
03-05-2019
7
larger the value of K, the newer the analysis frame. Also, in the following description of the
operation, it is assumed that the index representing the latest analysis frame to be analyzed is K,
unless otherwise noted. FRAME1 (1) = {s1 (1), s1 (2)... S1 (i), s1 (n)} FRAME1 (K) = {s1 (N × K +
1), s1 (N × K + 2). S 1 (N × K + i) s 1 (N × K + N)} (1)
[0023]
The FFT unit 10 performs fast Fourier transform processing for each analysis frame to Fouriertransform the analysis frame FRAME 1 (K) formed from the input signal s 1 into the frequency
domain signal X 1 (f, K), and the input signal A frequency domain signal X2 (f, K) obtained by
Fourier transforming an analysis frame FRAME2 (K) composed of s2 is acquired. Here, f is an
index representing a frequency. In addition, (f, K) is not a single value, and is configured from m
(m is an arbitrary integer) components (spectral components) of a plurality of frequencies f1 to
fm as in the following (2) Shall be
[0024]
The FFT unit 10 supplies the frequency domain signals X 1 (f, K) and X 2 (f, K) to the front
suppression signal generation unit 20 and the coherence calculation unit 30.
[0025]
Note that X 1 (f, K) is a complex number and is composed of a real part and an imaginary part.
The same applies to X 2 (f, K) and N (f, K) described in the front suppression signal
generation unit 20 described later. X1(f,K)={X1(f1,K)、X1(f2,K)、
・・X1(fi,K)・・、X1(fm,K)} …(2)
[0026]
Next, the front suppression signal generation unit 20 will be described.
[0027]
03-05-2019
8
The front suppression signal generation unit 20 performs processing for suppressing the signal
component in the front direction for each frequency component with respect to the signal
supplied from the FFT unit 10.
In other words, the front suppression signal generation unit 20 functions as a directional filter
that suppresses components in the front direction.
[0028]
For example, as illustrated in FIG. 3, the front suppression signal generation unit 20 uses the 8shaped bi-directional filter having a dead angle in the front direction to generate a component in
the front direction from the signal supplied from the FFT unit 10. Form a directional filter to
suppress
[0029]
Specifically, the front suppression signal generation unit 20 generates the following equation (3)
based on the signals X 1 (f, K) and X 2 (f, K) supplied from the FFT unit 10. Calculation
is performed to generate a front suppression signal N (f, K) for each frequency component.
The calculation of equation (3) below corresponds to the process of forming an 8-shaped bidirectional filter having a blind spot in the front direction as shown in FIG. 3 described above.
N(f,K)=X1(f,K)−X2(f,K) …(3)
[0030]
Then, the front suppression signal generation unit 20 calculates an average front suppression
signal AVE̲N (K) obtained by averaging N (f, K) over all frequencies using the following equation
(4).
[0031]
Next, the process of the coherence calculator 30 will be described.
[0032]
The coherence calculation unit 30 has a strong directivity (for example, as shown in FIG. 4A) in
03-05-2019
9
the left direction (first direction) for the frequency domain signals X1 (f, K) and X2 (f, K). The
signal processed by the filter of directivity (hereinafter referred to as directivity signal B1 (f) )
and the strong directivity (for example, as shown in FIG. 4B) in the right direction (second
direction) Coherence COH (K) and coherence filter coefficient coef (f, K) based on a signal
(hereinafter referred to as directivity signal B2 (f) ) processed by a single unidirectional filter .
[0033]
coef (f, K) is a component of an arbitrary frequency f (any one of frequencies f1 to fm)
constituting an index value K of a frame (analysis frames FRAME 1 (K) and FRAME 2 (K)) It is
assumed that the coherence in (1), that is, the coherence between the directional signal B1 (f)
and the directional signal B2 (f) is represented.
[0034]
When COH (K) and coef (f, K) are obtained, the directivity of the directivity signal B1 (f) and the
directivity signal B2 (f) may be any direction other than the front direction (however, , B1 (f) and
B2 (f) need to be in different directions).
[0035]
A specific calculation process (for example, calculation formula) for calculating the coherence
COH (K) is not limited, but, for example, similar to Japanese Patent Application Laid-Open No.
2013-182044 (hereinafter, referred to as Reference 1 ) Since the process of (for example, the
calculation process of Formula (3)-(7) Formula described in the reference 1) can be applied, it
abbreviate ¦ omits about a detail.
Further, the calculation process of the equations (3) to (6) described in the reference 1 can also
be applied to the coherence filter coefficient coef (f, K), for example, and thus the details will be
omitted.
[0036]
As described above, the coherence calculation unit 30 supplies the calculated coherence COH (K)
to the correlation calculation unit 40 and the coherence filter coefficient coef (f, K) to the
coherence filter processing unit 50.
03-05-2019
10
[0037]
The correlation calculation unit 40 determines the presence or absence of an unintended sound
using the front suppression signal N (f, K) (average front suppression signal AVE̲N (K)) having
directivity other than the front and the coherence COH (K). Calculate the possible correlation
coefficient cor (K).
[0038]
Here, it is assumed that the target sound comes from the front direction of the microphones m̲1
and m̲2 and the non-target sound including the disturbing sound comes from the lateral
direction (lateral direction).
For example, when the microphones m̲1 and m̲2 are applied to the microphone portion of a
receiver of a telephone terminal (for example, a portable telephone terminal etc.), the voice of the
speaker (user) as the target sound comes from the front direction of the microphones m̲1 and
m̲2 The voice of the telephone terminal other than the speaker comes from the left and right
direction (lateral direction).
[0039]
Therefore, for example, in the case of no disturbing voice present and desired sound
present , the average front suppression signal AVE̲N (K) of the front suppression signal N (f, K)
has the magnitude of the target sound component. It is a proportional value.
As shown in FIG. 2, in the directivity characteristic at the time of generation of the average front
suppression signal AVE̲N (K) (front suppression signal N (f, K)), "jamming voice does not exist"
and "target sound is present. Even in the case, the signal component coming from the front
direction is also included.
However, as shown in FIG. 2, the directivity characteristic at the time of generation of the
average front suppression signal AVE̲N (K) (front suppression signal N (f, K)) includes signal
components coming from the front direction, but Very small compared to the directional gain.
03-05-2019
11
Further, the gain of the front suppression signal N (f, K) in the case of no disturbing sound
and the target sound is present is smaller than that in the case of the disturbing sound.
[0040]
Further, simply stated, the coherence COH (K) can be said to be a correlation (feature amount) of
a signal arriving from the first direction (right direction) and a signal arriving from the second
direction (left direction).
Therefore, when the coherence COH (K) is small, the correlation between the two directional
signals B1 (f) and B2 (f) is small, and conversely, the correlation is large when the coherence COH
(K) is large. It can be paraphrased as the case.
When the correlation is small, the direction of arrival of the target sound is largely deviated to
either the right or the left, or a signal with less definite regularity such as noise even if there is no
deviation. Also, for example, when the microphones m̲1 and m̲2 are applied to the microphone
portion of a receiver of a telephone terminal (for example, a mobile telephone terminal etc.), the
speaker's voice (target voice) comes from the front and the disturbing voice is not the front There
is a strong tendency to come from As described above, the coherence COH (K) is a feature that
has a deep relationship with the incoming direction of the input signal. Therefore, the value of
coherence COH (K) tends to increase when there is no disturbing voice and when the target
sound is present, and when there is disturbing voice, coherence COH (K The value of) tends to be
smaller.
[0041]
If the behavior of each of the above values is arranged focusing on the presence or absence of
the disturbing voice, the presence or absence of the disturbing voice can be determined under
the following conditions. In the following, the condition that "the disturbance sound does not
exist" and "the target sound exist" (hereinafter referred to as "the first condition") and the
condition that "the disturbance sound exists" (hereinafter, "the second condition The method of
determining the presence or absence of the disturbing voice will be described separately in the
following.
03-05-2019
12
[0042]
In the case of the first condition (when there is no disturbing sound and when the target sound is
present), the coherence COH (K) becomes a relatively large value, and the average front
suppression signal AVE̲N (K) becomes The value is proportional to the magnitude of the target
sound component.
[0043]
On the other hand, in the case of the second condition (when there is a disturbing voice), the
value of the coherence COH (K) tends to be a small value, and the average front suppression
signal AVE̲N (K) tends to be a large value.
[0044]
Therefore, when the correlation coefficient cor (K) of the average frontal suppression signal
AVE̲N (K) and the coherence COH (K) is introduced, the relationship between the correlation
coefficient cor (K) and the presence or absence of disturbing speech is as follows: Become.
[0045]
When there is no disturbing voice, the correlation coefficient cor (K) tends to be a positive value
(a value equal to or higher than a predetermined value indicating that the correlation is high).
On the other hand, when the disturbing voice is present, the correlation coefficient cor (K) tends
to be a negative value (a value less than a predetermined value indicating low correlation).
[0046]
That is, by introducing the correlation coefficient cor (K) between the average front suppression
signal AVE̲N (K) and the coherence COH (K), for example, interference is caused by a simple
process of determining whether the correlation coefficient cor (K) is positive or negative. The
presence or absence of voice can be determined.
[0047]
Therefore, the correlation calculation unit 40 of this embodiment first obtains the correlation
coefficient cor (K) for determining the presence or absence of the disturbing voice.
03-05-2019
13
[0048]
The calculation method used when the correlation calculation unit 40 obtains the correlation
coefficient cor (K) is not limited. For example, reference 2 (Hiraoka Kazuyuki, Hori Gen. Writing
"Probability Statistics for Programming", The calculation method described in Ohmsha,
2009/10/20) can be applied.
The correlation calculation unit 40 may obtain the correlation coefficient cor (K), for example,
using the following equation (5).
[0049]
In the following equation (5), Cov [AVE̲N (K), COH (K)] indicates the covariance of the average
front suppression signal AVE̲N (K) and the coherence COH (K).
Further, in the following equation (5), σ AVE ̲ N (K) represents the standard deviation of the
average front suppression signal AVE ̲ N (K).
Furthermore, in the following equation (5), σCOH (K) indicates the standard deviation of the
coherence COH (K).
When obtaining the correlation coefficient cor (K) by the following equation (5), standard
deviation is obtained using results of a predetermined number i of frames processed most
recently for AVE̲N (K) and COH (K). Or you may make it ask for covariance. Specifically, in the
process of obtaining the correlation coefficient cor (K) by the following equation (5), for example,
i frames processed most recently (K-i-th frame, K- (i-1) Standard deviations (σ N (f, K) and σ
COH (K)) or covariances (CO N and C CO N), respectively, using COH and AVE̲N pertaining to
each of the Alternatively, Cov [AVE̲N (K), COH (K)] may be obtained. In other words, in the
process of obtaining the correlation coefficient cor (K), the correlation calculation unit 40 uses
the i pieces of AVE̲N and COH determined most recently as samples, and calculates the standard
deviation and covariance in the following equation (5) You may ask for it.
03-05-2019
14
[0050]
Next, processing of the coherence filter processing unit 50 will be described.
[0051]
As shown in FIG. 1, the coherence filter processing unit 50 of this embodiment will be described
as generating an audio signal in which the component of disturbing speech is suppressed with
respect to X1 (f, K).
Therefore, in this embodiment, the disturbing voice suppression signal O (f, K) is a signal
obtained by performing the disturbing voice suppression process (filter process) on X1 (f, K).
[0052]
The coherence filter processing unit 50 may perform processing for suppressing the component
of the disturbing voice for both X1 (f, K) and X2 (f, K). In addition, the coherence filter processing
unit 50 acquires a signal (for example, an average value of a plurality of signals) obtained by
combining X1 (f, K) and X2 (f, K), and the acquired signal is a component of disturbing voice.
Processing may be performed. The specific processing content that the coherence filter
processing unit 50 suppresses the noise will be described later.
[0053]
As described above, the coherence filter processing unit 50 obtains the disturbing voice
suppression signal O (f, K) for each frequency (frequency f1 to fm) of each frame, and supplies it
to the IFFT unit 60.
[0054]
Next, the process of the IFFT unit 60 will be described.
[0055]
03-05-2019
15
The IFFT unit 60 converts the supplied O (f, K) from the frequency domain into a signal in the
time domain to generate a disturbed speech suppression signal o (n).
The IFFT unit 60 performs inverse transform processing on the transform processing performed
by the FFT unit 10.
In this embodiment, since the FFT unit 10 performs FFT (Fast Fourier Transform), the IFFT unit
60 performs IFFT (Inverse Fourier Transform).
[0056]
Next, the details of the disturbing sound suppression process performed by the coherence filter
processing unit 50 will be described.
[0057]
The coherence filter processing unit 50 of this embodiment performs a flooring process on the
coherence filter coefficient coef (f, K) based on the aforementioned correlation coefficient cor (f,
K).
Then, the coherence filter processing unit 50 multiplies the input signal X1 (f, K) by the
coherence filter coefficient coef (f, K) that has been subjected to the flooring processing, to
thereby obtain the disturbed speech suppression signal O (f, K). Get
[0058]
By the way, the flooring threshold Θ (K) used in the flooring process has a larger value as the
influence of the disturbing voice is smaller, and a smaller value as the influence of the disturbing
voice is larger. Desirable from As described above, the coherence filter processing unit 50 adds
the correlation coefficient cor (f, K) whose positive or negative changes depending on the
presence or absence of disturbing voice to the preset flooring threshold Θ (K). Control can be
realized. The correlation calculation unit 40 may obtain the flooring threshold Θ (K), for
example, using the following equation (6). In the following equation (6), Ψ (K) is a predetermined
constant. Θ (K) = Ψ (K) + cor (K) (6)
03-05-2019
16
[0059]
The coherence filter processing unit 50 performs a flooring process on the coherence filter
coefficient coef (f, K) using the generated flooring threshold Θ (K) (details will be described in
the operation section).
[0060]
Then, the coherence filter processing unit 50 uses the coherence filter coefficients coef (f, K)
subjected to the flooring process to generate an interference sound component (interference
sound component of the input signal (in this embodiment, X1 (f, K)). ) To generate a disturbing
speech suppression signal O (f, K).
In the example of this embodiment, the coherence filter processing unit 50 multiplies the input
signal X1 (f, K) by the coherence filter coefficient coef (f, K) for each frequency component as in
the following equation (7) Then, the disturbing speech suppression signal O (f, K) can be
obtained. Disturbed speech suppression signal O (f, K) = input signal X1 (f, K) x coherence filter
coefficient coef (f, K) (7)
[0061]
(A-2) Operation of Embodiment Next, the operation of the speech processing apparatus 1 of this
embodiment having the above-described configuration will be described.
[0062]
First, the overall operation of the speech processing device 1 will be described with reference to
FIG.
[0063]
It is assumed that input signals s1 (n) and s2 (n) for one frame (one processing unit) are supplied
to the FFT unit 10 from the microphones m̲1 and m̲2 via an AD converter (not shown).
03-05-2019
17
Then, the FFT unit 10 Fourier-transforms the analysis frames FRAME1 (K) and FRAME2 (K) based
on the input signals s1 (n) and s2 (n) for one frame, and a signal X1 (f, K) Obtain X2 (f, K).
Then, the signals X1 (f, K) and X2 (f, K) generated by the FFT unit 10 are supplied to the front
suppression signal generation unit 20 and the coherence calculation unit 30. Also, the signal X 1
(f, K) generated by the FFT unit 10 is supplied to the coherence filter processing unit 50.
[0064]
The front suppression signal generation unit 20 calculates the front suppression signal N (f, K)
based on the supplied X1 (f, K) and X2 (f, K). Then, the front suppression signal generation unit
20 calculates the average front suppression signal AVE̲N (K) based on the front suppression
signal N (f, K), and supplies the average front suppression signal AVE̲N (K) to the correlation
calculation unit 40.
[0065]
On the other hand, the coherence calculation unit 30 generates the coherence COH (K) and the
coherence filter coefficient coef (f, K) based on the supplied X1 (f, K) and X2 (f, K). Then, the
coherence calculation unit 30 supplies the generated coherence COH (K) to the correlation
calculation unit 40, and supplies the coherence filter coefficient coef (f, K) to the coherence filter
processing unit 50.
[0066]
The correlation calculation unit 40 calculates the correlation coefficient cor (K) based on the
average front suppression signal AVE̲N (K) and the coherence COH (K), and supplies the
correlation coefficient processing unit 50 with the correlation coefficient cor (K).
[0067]
The coherence filter processing unit 50 calculates a flooring threshold Θ (K) from the supplied
correlation coefficient cor (K).
03-05-2019
18
The coherence filter processing unit 50 performs a flooring process on the supplied coherence
filter coefficient coef (f, K) using the calculated flooring threshold value Θ (K). Then, the
coherence filter processing unit 50 suppresses the disturbing voice (disturbing voice component)
of the input signal X1 (f, K) using the coherence filter coefficient coef (f, K) subjected to the
flooring process, A disturbing speech suppression signal O (f, K) is generated and supplied to the
IFFT unit 60.
[0068]
The IFFT unit 60 performs an inverse Fourier transform (IFFT) process on the supplied disturbed
speech suppression signal O (f, K), converts it to a time domain disturbed speech suppression
signal o (n), and outputs it.
[0069]
Next, the operation details of the coherence filter processing unit 50 will be described using the
flowcharts of FIGS. 5 and 6.
[0070]
FIG. 5 is a flowchart showing processing of the coherence filter processing unit 50.
FIG. 6 is a flowchart showing part of the processing (flooring processing) of the flowchart of FIG.
The coherence filter processing unit 50 processes the flowcharts of FIGS. 5 and 6 each time the
correlation coefficient cor (K), the coherence filter coefficient coef (f, K), and the input signal X1
(f, K) are supplied. To output a disturbed speech suppression signal O (f, K).
[0071]
First, it is assumed that the correlation filter processing unit 50 is supplied with the correlation
coefficient cor (K), the coherence filter coefficient coef (f, K), and the input signal X1 (f, K) (S101).
[0072]
03-05-2019
19
Next, the coherence filter processing unit 50 calculates a flooring threshold Θ (K) based on the
correlation coefficient cor (K) (S102).
Specifically, the coherence filter processing unit 50 can obtain the flooring threshold Θ (K) using
the above-mentioned equation (6).
[0073]
Next, the coherence filter processing unit 50 performs a flooring process on the coherence filter
coefficient coef (f, K) using the calculated flooring threshold Θ (K) (S103).
[0074]
Next, the coherence filter processing unit 50 uses the coherence filter coefficient coef (f, K)
subjected to the flooring processing to suppress the interference sound component of the input
signal X 1 (f, K) (filter processing) To generate the disturbing speech suppression signal O (f, K)
(S104).
Specifically, the coherence filter processing unit 50 calculates the coherence filter coefficient
coef (wherein the input signal X1 (f, K) is subjected to the floor processing for each frequency
component as in the above-mentioned equation (7)). The disturbed speech suppression signal O
(f, K) is obtained by multiplying (multiplying) f, K).
[0075]
Next, the coherence filter processing unit 50 outputs (transmits to the IFFT unit 60) the
determined disturbing voice suppression signal O (f, K) (S105).
[0076]
Next, a specific example of the flooring process performed by the coherence filter processing unit
50 in the above-described step S103 will be described using the flowchart of FIG.
[0077]
03-05-2019
20
When the flooring processing is started, the coherence filter processing unit 50 confirms the
values of the coherence filter coefficient coef (f, K) and the flooring threshold Θ (K) (S201), and
compares the magnitudes of the two values.
[0078]
When the coherence filter coefficient coef (f, K) is smaller than the flooring threshold Θ (K), the
coherence filter processor 50 sets the value of the coherence filter coefficient coef (f, K) to the
flooring threshold Θ (K). Processing is performed (S202).
[0079]
On the other hand, when the coherence filter coefficient coef (f, K) is equal to or greater than the
flooring threshold Θ (K), the coherence filter processor 50 keeps the value of the coherence
filter coefficient coef (f, K) as it is (especially processing nothing Do not) (S203).
[0080]
(A-3) Effects of the Embodiment According to this embodiment, the following effects can be
achieved.
[0081]
In the speech processing device 1 of this embodiment, when the disturbing speech is present, the
correlation coefficient cor (K) between the average front suppression signal AVE̲N (K) and the
coherence COH (K) is negative, and the disturbing speech is present. By subjecting the coherence
filter coefficient coef (f, K) to a flooring process based on a characteristic behavior that is positive
if not performed, it is possible to improve the accuracy of the disturbed speech suppression.
Thereby, in the voice processing device 1 of this embodiment, the disturbing voice can be
suppressed without excess or deficiency without affecting the sound quality of the input signal.
That is, in the voice processing of the voice processing device 1 (for example, the communication
device such as a television conference system or a mobile phone, or the preprocessing of the
voice recognition function), the performance is improved Can be expected.
03-05-2019
21
[0082]
(B) Other Embodiments The present invention is not limited to the above-described embodiment,
and may include modified embodiments as exemplified below.
[0083]
(B-1) Although the speech processing device 1 of the above embodiment calculates the flooring
threshold Θ (K) by adding the correlation coefficient cor (K) (for example, the equation (6)), In
addition to the above, any operation may be performed as long as a desired flooring
characteristic can be obtained according to the magnitude of the contribution of the disturbing
voice.
[0084]
(B-2) The speech processing device 1 of the above embodiment calculates the flooring threshold
Θ (K) in frame units, but the present invention is not limited thereto. For example, the flooring
threshold Θ (K) for each frequency bin You may ask for
In this case, the speech processing device 1 may calculate the correlation coefficient cor (K)
which is the basis of the flooring threshold Θ (K) also for each frequency bin.
[0085]
(B-3) Although the voice processing device 1 of the above embodiment has described an example
of performing processing based on input signals supplied from two microphones, the voice
processing device 1 is supplied from three or more microphones. The determination process may
be performed based on the input signal.
For example, in the voice processing apparatus 1, based on input signals supplied from three or
more microphones, the front suppression signal N (f, K) having a dead angle in the front direction
or directivity in a predetermined direction other than the front The directivity signals B1 (f) and
B2 (f) may be acquired and processing similar to that of the above embodiment may be
performed.
03-05-2019
22
That is, in the audio processing device 1, the configuration of the microphone for acquiring the
front suppression signal N (f, K) and the directivity signals B1 (f) and B2 (f) is not limited.
[0086]
(B-4) In the coherence filter processing unit 50 of the above embodiment, the average front
suppression signal AVE̲N (K) and the average front suppression signal AVE̲N (K) are used as
feature quantities representing the relationship between the average front suppression signal
AVE̲N (K) and the coherence COH (K). Although the correlation coefficient cor (K) with the
coherence COH (K) is applied, other types of values may be applied as feature quantities.
For example, in the coherence filter processing unit 50, as a feature representing the relationship
between the average front suppression signal AVE̲N (K) and the coherence COH (K), the average
front suppression signal AVE̲N (K) and the coherence COH (K) Dispersion may be applied.
[0087]
DESCRIPTION OF SYMBOLS 1 ... Speech processing apparatus, 10 ... FFT part, 20 ... Front
suppression signal generation part, 30 ... Coherence calculation part, 40 ... Correlation calculation
part, 50 ... Coherence filter processing part, 60 ... IFFT part, m̲1, m̲2 ... Microphone.
03-05-2019
23