close

Вход

Забыли?

вход по аккаунту

JPH0667691

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JPH0667691
[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a
noise removal apparatus for use in a speech recognition apparatus or the like, for removing noise
from speech uttered in noise.
[0002]
2. Description of the Related Art When speech recognition and speech communication are
performed, various noises exist depending on the use environment, and these noises are a major
factor that lowers the recognition rate of speech recognition and hinders speech communication.
[0003]
Conventionally, using two microphones, a voice microphone mainly inputting voice and a noise
microphone mainly inputting ambient noise, the noise component contained in the voice
microphone is estimated, and the estimated noise is removed from the voice including noise.
There is a method called so-called two-input spectral subtraction that converts to a clear speech.
[0004]
For example, Shibamura et al., "In-vehicle speech recognition using a two-input noise removal
method" Technical Report of IEICE. A noise removal apparatus using two-input spectral
subtraction as described in 41-48 (1989) (hereinafter referred to as cited document [1]) has a
03-05-2019
1
configuration as shown in FIG.
That is, in FIG. 16, a position is placed in front of the speaker's mouth and voice microphone 201
for mainly inputting voice is inputted as much as possible as ambient noise inputted to the voice
microphone, and voice is not mixed as much as possible. Two channels of noise microphones
202 installed at the same time are simultaneously input.
The voice including noise input by the voice microphone 201 is converted into a time-series
feature vector of voice including noise in the voice feature extraction unit 203, and ambient noise
input by the noise microphone 202 is converted into noise in the noise feature extraction unit
204. It is converted to a time series feature vector of noise. In the 2-input subtraction unit 205,
first, the noise component included in the time-series feature vector of the speech including noise
obtained from the feature extraction unit 203 is converted to the time-series feature vector of
ambient noise obtained from the feature extraction unit 204. Use to estimate. For this noise
component estimation, for example, two inputs are compared at a time position not including
speech, a correction coefficient between two inputs is calculated in advance, and the obtained
correction coefficient is obtained from the noise feature extraction unit 204 This is done by
multiplying the entire time series feature vector of ambient noise. Next, the 2-input subtraction
unit 205 subtracts the estimated time-series feature vector of noise from the entire time-series
feature vector of the noise-containing speech obtained from the speech feature extraction unit
203 to obtain clear speech after noise removal. Output time series feature vectors of By
performing speech recognition using the clear speech time-series feature vector obtained here,
speech recognition with less deterioration of the recognition rate due to noise is to be realized.
[0005]
However, in a normal noise environment, the noise transfer characteristics include non-stationary
noise sources whose characteristics change temporally and spatially, such as moving sound of an
object or human speech. Because noise and noise arrive from time to time, the noise component
input to the voice microphone and the noise input to the noise microphone are always the same
in two-input spectral subtraction using a single noise microphone in the past. However, the
estimation of the noise contained in the speech has an error and the denoising effect is reduced.
Also, in the conventional two-input spectral subtraction, depending on the installation method of
the noise microphone or the noise environment used, the voiced voice may be mixed into the
noise microphone, and the feature obtained by mixing the mixed voice from the voice
microphone In order to subtract from the vector, the feature vector component of the voice that
should not be removed may be removed, which has a disadvantage that the speech recognition
03-05-2019
2
rate or the intelligibility of communication is significantly reduced.
[0006]
SUMMARY OF THE INVENTION The object of the present invention is to solve the abovementioned problems, to efficiently remove noise against non-stationary noise whose property
changes temporally and spatially, and to reduce noise to noise microphones. It is an object of the
present invention to provide a stable noise removal device that does not remove a necessary
audio signal even when mixing occurs.
[0007]
According to a first aspect of the present invention, there is provided a voice microphone for
mainly inputting voice, a plurality of noise microphones mainly for receiving ambient noise and
arranged around the voice microphone, and a voice microphone. A speech feature extraction unit
that converts an output signal into a time-series feature vector of speech, a plurality of noise
feature extraction units that respectively convert output signals of a plurality of noise
microphones into a time-series feature vector of noise, and a plurality of noise feature extraction
units A noise detection unit that selects a time-series feature vector of noise closest to ambient
noise from among time-series feature vectors of noise obtained from a selection unit that selects
and outputs a time-series feature vector of noise selected by the noise detection unit And a twoinput subtrack that subtracts the time-series feature vector of the noise output by the selection
unit from the time-series feature vector of the sound output by the voice feature extraction unit It
is characterized in that it comprises a ® emission portion.
[0008]
According to a second aspect of the present invention, there is provided a voice microphone for
mainly inputting voice, a plurality of noise microphones mainly for receiving ambient noise and
arranged around the voice microphone, and an output signal of the voice microphone as a time
series feature vector of voice. Select the output signal of the noise microphone selected by the
minimum power detection unit and the minimum power detection unit that selects the output
signal of the noise microphone with the lowest power among the output signals of the plurality
of noise microphones A noise feature extraction unit for converting the output signal of the noise
microphone output from the selection unit into a time-series feature vector of noise, and a noise
feature from the time-series feature vector of speech output from the voice feature extraction
unit And a two-input subtraction unit that subtracts a time-series feature vector of noise output
from the extraction unit. .
[0009]
03-05-2019
3
According to a third aspect of the present invention, there is provided a voice microphone for
mainly inputting voice, a plurality of noise microphones mainly for receiving ambient noise and
arranged around the voice microphone, and an output signal of the voice microphone as a time
series feature vector of voice. A speech feature extraction unit for converting into noise, a
plurality of noise feature extraction units for converting output signals of a plurality of noise
microphones into time series feature vectors of noise, and a time series feature vector of noise
output from the plurality of noise feature extraction units A similarity calculation unit that
calculates and outputs the similarity between the voice feature extraction unit and the time-series
feature vector of the voice output by the voice feature extraction unit; A value detection unit; a
selection unit for selecting and outputting a time series feature vector of noise corresponding to
a similarity degree selected by the maximum value detection unit among time series feature
vectors of noise; It is characterized in that it comprises a two-input subtraction unit subtracting
the n-th time-series feature vector of noise output by the selecting unit from the time series
feature vectors of speech symptoms extracting unit outputs.
[0010]
According to a fourth aspect of the present invention, in the third aspect, the method further
comprises a weighting unit that adds a predetermined weight to the degree of similarity output
by the degree of similarity calculation unit and outputs a weighted similarity, and the maximum
value detection unit It is characterized in that the highest similarity is selected among the
weighted similarities output by the addition unit.
[0011]
According to a fifth aspect of the present invention, there is provided a voice microphone for
mainly inputting voice, a plurality of noise microphones mainly for receiving ambient noise and
arranged around the voice microphone, and an output signal of the voice microphone as a timeseries feature vector of voice A voice feature extraction unit for converting into an audio partial
feature extraction unit for converting an output signal of a voice microphone into a time series
feature vector of a voice partial band, and a time series of a noise partial band for each of output
signals of a plurality of noise microphones A plurality of partial feature extraction units for
converting into feature vectors, a time-series feature vector of noise sub-bands output by the
plurality of partial feature extraction units, and a time-series feature vector of speech partial
bands output by the audio partial feature extraction unit Between the similarity calculated by the
sub-band similarity calculation unit and the similarity score output by the subband similarity
calculation unit. A selection unit for selecting and outputting an output signal from the noise
microphone corresponding to the degree of similarity selected by the maximum value detection
unit among the output signals of the plurality of noise microphones; Noise feature extraction unit
for converting an output signal from a noise microphone to a time-series feature vector of noise,
and time-sequence feature vector of noise output from the noise feature extraction unit from
03-05-2019
4
time-series feature vectors of speech output from the speech feature extraction unit And a twoinput subtraction unit that subtracts.
[0012]
A sixth invention is characterized in that, in the third and fourth inventions, a minimum value
detection unit for obtaining the minimum similarity among the inputted similarity is provided
instead of the maximum value detection unit.
[0013]
A seventh aspect of the present invention is a voice microphone for mainly inputting voice, a
plurality of noise microphones mainly for receiving ambient noise and arranged around the voice
microphone, and an output signal of the voice microphone as a time-series feature vector of voice
Speech feature extraction unit for converting into a plurality of noise feature extraction units for
transforming output signals of a plurality of noise microphones into time series feature vectors of
noise, and time series features of noise obtained from the plurality of noise feature extraction
units Average value combining unit that averages vectors and outputs the averaged feature
vector as a combined vector of noise, and noise combined vector output from the average value
combining unit from time-series feature vectors of speech output by the voice feature extraction
unit And a subtractive two-input subtraction unit.
[0014]
The eighth invention is the seventh invention, wherein a predetermined weight is added to a
time-series feature vector of noise output from the noise feature extraction unit instead of the
average value synthesis unit, and then averaging and averaging the characteristics The method is
characterized by including a weighted average value combining unit that outputs a vector as a
noise combined vector.
[0015]
A ninth aspect of the present invention is a voice microphone for mainly inputting voice, a
plurality of noise microphones mainly for receiving ambient noise and arranged around the voice
microphone, and an output signal of the voice microphone as a time-series feature vector of voice
A voice feature extraction unit for converting into noise, first to Nth noise feature extraction units
for converting output signals of a plurality of noise microphones into first to Nth time-series
feature vectors of noise, and a plurality of noise feature extraction units A division unit that
divides each of time series feature vectors of noise to be output into a plurality of bands and
outputs the same, and a minimum power is extracted for each band of time series feature vectors
of noise after band division that the division section outputs A minimum value combining unit for
03-05-2019
5
combining the respective minimum values of each band and outputting the result as a combined
vector of noise, and a minimum value combining unit based on a time-series feature vector of
speech output from the speech feature extraction unit It is characterized in that it comprises a
two-input subtraction unit subtracting the resultant vector of noise.
[0016]
According to a tenth aspect of the present invention, in the first aspect, the noise section
detecting section detects a section in which no voice exists using the output signal obtained from
the voice microphone as a noise section, and the noise detecting section detects the section by
the noise section detecting section. It is characterized in that the noise time-series feature vector
is selected using the noise time-series feature vector of the noise segment.
[0017]
In an eleventh aspect based on the second aspect, the noise power detection unit includes a noise
power detection unit that detects a power non-existent power section as a noise power section
using an output signal obtained from a voice microphone, and the minimum power detection
section is a noise power detection section. The output signal of the noise microphone is selected
using the output signal of the noise microphone in the noise zone detected by
[0018]
According to a twelfth aspect of the present invention, in the third or fourth aspect, the noise
section detection unit detects a section in which no voice exists using the output signal obtained
from the voice microphone as a noise section, and the similarity calculation section It is
characterized in that the similarity is calculated and output using the time-series feature vector of
the noise of the noise section detected by the section detection unit.
[0019]
In a thirteenth aspect of the present invention, in the fifth aspect, the noise section detection unit
detects a section in which no voice exists using the output signal obtained from the voice
microphone as a noise section, and the sub-band similarity calculation section It is characterized
in that the similarity is calculated and output using the time-series feature vector of the noise
sub-band of the noise section detected by the detection unit.
[0020]
A fourteenth invention is characterized in that, in the tenth, eleventh, twelfth or thirteenth
invention, the noise interval detecting section detects an interval in which no voice exists as a
03-05-2019
6
noise interval using the feature vector outputted by the 2-input subtraction section. There is.
[0021]
In a fifteenth invention according to the third, fourth or fifth invention, the noise section
detecting unit or the 2-input subtraction unit detects a section in which no voice is present as a
noise section using an output signal obtained from a voice microphone. A noise section detection
unit that detects a section in which speech does not exist using a feature vector to be output as a
noise section is provided, and instead of the maximum value detection section, the similarity that
is input within the noise section detected by the noise section detection unit It is characterized in
that a maximum / minimum value detection unit is selected out of which the highest similarity is
selected and which selects the lowest similarity among the inputted similarities when the noise
interval detection unit does not detect the noise interval.
[0022]
The operation of the first invention will be described with reference to FIG.
The voice containing noise is converted into an electrical signal by the voice microphone 1.
At the same time, ambient noise is converted into an electrical signal by two or more first to Nth
noise microphones 2 placed around the voice microphone 1.
There are many installation methods of two or more first to N noise microphones 2. For example,
they may be arranged around an audio microphone while keeping an appropriate distance, or to
cope with noise coming from all directions. It may be arranged radially or directed towards a
specific noise source.
The voice feature extraction unit 3 is a converter that converts the electrical signal obtained from
the voice microphone 1 into time-series feature quantities that express acoustic features in time
series, and, for example, Furui "digital voice processing", Pp. Tokai University Press (1985)
(hereinafter referred to as cited document [2]). It consists of DFT (discrete Fourier transformer),
FFT (fast Fourier transformer) or BPF (band filter bank) etc. as described in 37-49, for example,
power spectrum, amplitude spectrum or BPF output etc. Output time series data of feature
vector.
03-05-2019
7
Further, the first to N-th audio feature extraction unit 4 is a time-series feature quantity that
represents acoustic features in time-series in the electric signals obtained from the two or more
first to N-th noise microphones 2, respectively. , And outputs the first to Nth time-series feature
vectors of noise.
The first to Nth noise feature extraction units 4 have the same functions as the voice feature
extraction unit 3.
The noise detection unit 5 selects the nth time series feature vector of noise closest to the
ambient noise from the first to Nth time series feature vectors of the noise obtained from the first
to Nth noise feature extraction section 4.
The determination as to whether the noise is closest to the ambient noise can be made, for
example, by storing the first to N-th time-series feature vectors of noise Yi (t) (1 ≦ i N N, t: time),
ambient noise stored in advance. Let R be the feature vector of R. At time t, noise such as n =
argmin (i) [‖Yi (t) −R‖] where the distance between vectors with the feature vector R of the
ambient noise is minimum This can be performed by obtaining n in the n-th time-series feature
vector Yi (t).
The determination as to whether the noise is closest to the ambient noise can also be made by
using information of the frequency distribution state, such as whether the low frequency power
is larger than the high frequency power.
However, argmin (i) [] is a function for obtaining i which gives the minimum value for the
operation result in [].
The n-th time-series feature vector of noise selected in the noise detection unit 4 is selected and
output in the selection unit 6.
The 2-input subtraction unit 7 performs 2-input spectral subtraction by subtracting the nth timeseries feature vector of the noise output from the selection unit 6 from the time-series feature
vector of the voice including noise output from the voice microphone 1. Remove the noise
03-05-2019
8
contained in the voice.
The two-input subtraction unit 7 has the same function as that of the two-input subtraction unit
205 shown in FIG. 16 as described in, for example, the cited document [1].
That is, according to the first aspect of the present invention, of the first to Nth time-series
feature vectors of noise output from the two or more first to Nth noise microphones 2, the nth
time-series feature of noise closest to ambient noise By selecting the vector, the output from the
noise microphone that always has the highest noise removal effect is selected even if the noise
source moves or the noise transfer characteristic changes temporally and spatially. It has the
effect of
Also, by selecting the n-th time-series feature vector of noise closest to ambient noise, the output
signal from the noise microphone whose sound wrap-up is large is not selected. This has the
effect of preventing a decrease in the recognition rate of communication or intelligibility of
communication.
[0023]
The operation of the second invention will be described with reference to FIG.
The voice containing noise is converted into an electrical signal by the voice microphone 11, and
at the same time, ambient noise is converted into an electrical signal by two or more first to Nth
noise microphones 12 placed around the voice microphone 11. The voice feature extraction unit
13 is a converter that converts the electrical signal obtained from the voice microphone 11 into
time-series feature quantities that express acoustic features in time series, and the voice feature
extraction unit 13 It has the same function as the voice feature extraction unit 3 in 1. The
minimum power detection unit 14 selects an output signal of the nth noise microphone whose
power is the smallest among the output signals of the two or more first to Nth noise microphones
12. That is, assuming that the first to Nth powers obtained from the two or more first to Nth
noise microphones 12 is Pi (t) (1 ≦ i ≦ N), the minimum power detection unit 14 detects The
operation is performed by n = argmin (i) [Pi (t)], and n is obtained for Pi having the smallest
power. The power of the output signal used here may be the power of the signal limited to a
partial band. The output signal of the nth noise microphone selected by the minimum power
detection unit 14 is selected by the selection unit 15 and output. The output signal of the nth
03-05-2019
9
noise microphone selected in the selection unit 15 is converted into a time-series feature vector
of noise in the noise feature extraction unit 16. The noise feature extraction unit 16 has the same
function as the speech feature extraction unit 3 in FIG. The 2-input subtraction unit 17 has the
same function as that of the 2-input subtraction unit 7 in FIG. 1, and from the time-series feature
vector of the voice output by the voice feature extraction unit 13, the noise output by the noise
feature extraction unit 16 Two input spectral subtraction is performed by subtracting series
feature vectors. That is, the second invention eliminates an input from a specific noise source
when there is no noise source in the vicinity of the voice microphone 11 and the noise source is
moving in the vicinity of a plurality of noise microphones. And the correlation between the noise
input to the voice microphone 11 and the noise obtained from the selection unit 15 is high, and
the noise is higher than when two-input spectral subtraction is performed using one conventional
noise microphone. There is an effect that the removal performance is obtained. Also, by using the
nth time-series feature vector from the noise microphone having the minimum power, the output
signal from the noise microphone with a large amount of voice wraparound is not selected, so the
speech recognition rate by the voice wraparound into the noise microphone Alternatively, there
is an effect that a decrease in communication intelligibility can be prevented.
[0024]
The operation of the third invention will be described with reference to FIG. The voice containing
noise is converted into an electrical signal by the voice microphone 21, and at the same time,
ambient noise is converted into an electrical signal by two or more first to Nth noise
microphones 22 placed around the voice microphone 11. The voice feature extraction unit 23 is
a converter that converts the electrical signal obtained from the voice microphone 21 into timeseries feature quantities that express acoustic features in time series. The first to Nth noise
feature extraction unit 24 is a converter that converts the electric signals of the two or more first
to Nth noise microphones 22 into time-series feature quantities that express acoustic features in
time series. Yes, and outputs first to Nth time-series feature vectors of noise. The voice feature
extraction unit 23 and the first to Nth noise feature extraction units 24 have the same functions
as the voice feature extraction unit 3 in FIG. 1. The similarity calculation unit 25 sets the first to
Nth time-series feature vectors of noise output from the first to N-th noise feature extraction unit
24 and the time-series feature vector of speech output from the speech feature extraction unit
23. The first to N-th similarities are calculated and output, respectively. The method of obtaining
the first to N-th similarities may be, for example, X (t) of the time-series feature vector of the
speech obtained from the speech microphone 21 or the noise obtained from the first to N-th
noise feature extraction unit 24 Assuming that the first to Nth time-series feature vectors are Yi
(t) and the similarity to be obtained is βi (t),
03-05-2019
10
[0025]
It can be determined by There are many other ways of determining the degree of similarity, but it
can also be determined by a method using an inner product of vectors as described in the cited
document [2], for example. The maximum value detection unit 26 selects the largest n-th
similarity among the first to N-th similarities output by the similarity calculation unit 25. The
selection unit 27 selects and outputs an nth time-series feature vector of noise corresponding to
the nth similarity selected by the maximum value detection unit 26 among the first to Nth timeseries feature vectors of noise. The 2-input subtraction unit 28 has the same function as the 2input subtraction unit 7 in FIG. 1, and from the time-series feature vector of the voice output by
the voice feature extraction unit 23, the nth time of the noise output by the selection unit 27 Two
input spectral subtraction is performed by subtracting series feature vectors. That is, in the third
invention, the noise removal effect is always best by using the time-series feature vector from the
n-th noise microphone that inputs the noise having the highest correlation with the noise input to
the voice microphone 21. There is an effect that high noise removal performance can be obtained
as compared to the case of performing two-input spectral subtraction using one noise
microphone.
[0026]
The operation of the fourth invention will be described with reference to FIG. 4, in addition to the
configuration of the noise removal apparatus shown in FIG. 3, predetermined weights are added
to the first to N-th similarities output by the similarity calculation unit 25 to obtain weighted first
to N-th similarities. The maximum value detection unit 26 is configured to select the largest n-th
similarity among the weighted first to N-th similarities output by the weight addition unit 29.
There is. That is, according to the fourth aspect of the present invention, by weighting the first to
N-th similarities, it is possible to select with particular emphasis on the input from a specific
noise microphone. Thus, for example, the input from the noise microphone 22 installed at a
position near the voice microphone 21 is given more weight, and the noise microphone 22
located at a position far from the voice microphone 21 is given a smaller weight. The emphasis is
on the input from the nearby noise microphone 22 where noise highly correlated with the
ambient noise that is input to the voice microphone 21 may be input, and noise removal is high
compared to conventional two-input spectral subtraction. There is an effect that performance is
obtained. Alternatively, for example, the input from the distant noise microphone 22 with less
possibility of mixing of voices given less weight to the input from the noise microphone 22
installed near the voice microphone 21 is emphasized. There is an effect that it is possible to
prevent the deterioration of the recognition rate and the lowering of the communication
intelligibility due to the mixing of the voice into the noise microphone.
03-05-2019
11
[0027]
The operation of the fifth aspect of the invention will be described with reference to FIG. The
voice containing noise is converted into an electrical signal by the voice microphone 41, and at
the same time, ambient noise is converted into an electrical signal by the two or more first to Nth
noise microphones 42. The voice feature extraction unit 43 is a converter that converts the
electrical signal obtained from the voice microphone 41 into time-series feature quantities that
represent acoustic features in time series. The voice feature extraction unit 23 has the same
function as the voice feature extraction unit 3 in FIG. The voice partial feature extraction unit 44
is a converter that converts the electrical signal obtained from the voice microphone 41 into
time-series feature quantities that represent acoustic features of a partial band in a time-series
manner, for example, BPF, DFT The partial frequency band selected from the analysis result by B.
is output as the feature vector of the audio sub-band. The characteristics of this sub-band also
include other analysis results such as cepstrum analysis described in the cited document [2], and
feature values compressed by KL transformation or the like. The first to N-th partial feature
extraction unit 45 converts the electrical signals of the two or more first to N-th noise
microphones 42 into time-series feature quantities that represent acoustic features of a partial
band in a time-series manner It is a converter and outputs first to Nth time-series feature vectors
of the noise sub-band. The first to Nth partial feature extraction units 45 have the same function
as the audio partial feature extraction unit 44. The sub-band similarity calculation unit 46
calculates the first to Nth time-series feature vectors of the noise sub-band output from the first
to N-th partial feature extraction unit 45 and the audio part output from the audio partial feature
extraction unit 44. The first to N-th similarities between time series feature vectors of the band
are calculated and output, respectively. The maximum value detection unit 47 selects the largest
n-th similarity among the first to N-th similarities output from the sub-band similarity calculation
unit 46. The selection unit 48 selects an output signal from the nth noise microphone
corresponding to the nth similarity selected by the maximum value detection unit 47 among the
output signals of the two or more first to Nth noise microphones 42. Output. The output signal
from the n-th noise microphone obtained from the selection unit 48 is converted into a timeseries feature vector of noise in the noise feature extraction unit 49. The noise feature extraction
unit 49 has the same function as the voice feature extraction unit 3 in FIG. The 2-input
subtraction unit 50 has the same function as that of the 2-input subtraction unit 7 in FIG. 1, and
the noise feature extraction unit 49 outputs noise from the time-series feature vector of the voice
output by the voice feature extraction unit 43. Two input spectral subtraction is performed by
subtracting series feature vectors.
That is, the fifth invention always performs 2-input spectral subtraction using the output signal
03-05-2019
12
of the noise microphone into which noise having the highest correlation with the feature vector
of the noise sub-band input to the voice microphone 41 is input. The noise removal effect is the
best, and there is an effect that high noise removal performance can be obtained as compared
with the case of performing two-input spectral subtraction using one conventional noise
microphone. In particular, when it is known in advance that the band in which noise is present is
limited, it is possible to remove noise more accurately by setting the sub band in advance to the
band in which noise is present. There is.
[0028]
The operation of the sixth aspect of the invention will be described with reference to FIG. 6 has a
minimum value detection unit 30 in place of the maximum value detection unit 26 in the noise
removal apparatus shown in FIG. 3, and the minimum value detection unit 30 selects one of the
input first to Nth similarities. The smallest n-th similarity is selected. That is, according to the
fifth aspect of the invention, by performing 2-input spectral subtraction using the output signal
of the noise microphone 22 having the lowest degree of similarity with the input signal of the
voice microphone 21, the noise microphone 22 of the least In order to select the output signal,
the wraparound of the voice into the noise microphone has an effect of preventing a drop in
voice recognition rate or intelligibility of communication due to subtraction of the voice itself.
Although the example applied to the 3rd invention was shown in FIG. 6, it is possible to take the
same composition to the 4th or 5th invention.
[0029]
The operation of the seventh invention will be described with reference to FIG. The voice
containing noise is converted to an electrical signal by the voice microphone 61, and at the same
time, ambient noise is converted to an electrical signal by the two or more first to Nth noise
microphones 62. The voice feature extraction unit 63 is a converter that converts the electrical
signal obtained from the voice microphone 61 into time-series feature quantities that express
acoustic features in time series.
[0030]
The first to Nth noise feature extraction unit 64 is a converter that converts the electric signals of
the two or more first to Nth microphones 62 into time-series feature quantities that express
03-05-2019
13
acoustic features in time series. , And outputs first to Nth time-series feature vectors of noise. The
voice feature extraction unit 63 and the first to Nth noise feature extraction units 64 have the
same functions as the voice feature extraction unit 3 in FIG. 1. The first to Nth time-series feature
vectors of noise obtained from the first to Nth noise feature extraction unit 64 are averaged by
the average value combining unit 65 and output as a combined vector of noise. That is, assuming
that a time-series feature vector obtained from two or more first to Nth microphones 62 is Yi (t),
and a resultant vector of noise is M (t), the average value combining unit 65 at t
[0031]
The following operation is performed to calculate and output a combined vector M (t) of timeseries feature vectors obtained from two or more of the first to Nth noise microphones 62. As a
method of obtaining the average value, in addition to such calculation, a geometric average can
be used, or a centroid (pattern center) described in the cited document [2] can be used. The 2input subtraction unit 66 has the same function as that of the 2-input subtraction unit 7 in FIG. 1
and combines the noise output from the average value combining unit 65 from the time-series
feature vector of the voice output from the voice feature extraction unit 63. Two input spectral
subtraction is performed by subtracting vectors. That is, according to the seventh aspect of the
present invention, the noise is reduced to the first to Nth noises by performing two-input spectral
subtraction using an average vector of time-series feature vectors obtained from two or more
first to Nth microphones 62. As more noise microphones are input to the microphone 62, more
noise is reflected in the combined vector, and conversely, noise input only to a specific noise
microphone is not included in the combined vector to perform the averaging operation. Since this
is not greatly reflected, there is an effect that the removal error due to the noise input only to the
specific noise microphone is reduced.
[0032]
The operation of the eighth invention will be described with reference to FIG. FIG. 8 includes a
weighted average value combining unit 67 instead of the average value combining unit 65 shown
in FIG. 7, and the weighted average value combining unit 67 outputs the noise output from the
first to Nth noise feature extraction units. A predetermined weight is added to the first to N-th
time-series feature vectors and then averaged, and the averaged feature vector is output as a
composite vector of noise. That is, since the eighth invention can particularly emphasize the
input from a specific noise microphone by adding a weight, the same effect as the fourth
invention has is obtained, and two more. By using the average vector of the time series feature
vectors obtained from the first to Nth noise microphones 62, the same effect as that of the
03-05-2019
14
seventh invention is provided.
[0033]
The operation of the ninth invention will be described with reference to FIG. The voice containing
noise is converted to an electrical signal by the voice microphone 81, and at the same time,
ambient noise is converted to an electrical signal by two or more first to Nth noise microphones
82 installed around the voice microphone 81. The voice feature extraction unit 83 is a converter
that converts the electrical signal obtained from the voice microphone 81 into time-series feature
quantities that represent acoustic features in time series. The first to Nth noise feature extraction
unit 84 is a converter that converts electric signals of the two or more first to Nth noise
microphones 82 into time-series feature quantities that express acoustic features in time series.
Yes, and outputs first to Nth time-series feature vectors of noise. The voice feature extraction unit
83 and the first to Nth noise feature extraction units 84 have the same functions as the voice
feature extraction unit 3 in FIG. 1. The first to Nth time-series feature vectors of noise output
from the first to Nth noise feature extraction unit 84 are each divided into a plurality of bands in
the division unit 85 and output. The minimum combining unit 86 takes out the minimum power
for each band of the time-series feature vector of noise after band division output from the
dividing unit 85, combines the respective minimum values of each band, and outputs as a
combined vector of noise Do. The two-input subtraction unit 87 has the same function as the
two-input subtraction unit 7 in FIG. 1 and combines the noise output by the minimum value
synthesis unit 86 from the time-series feature vector of the speech output by the speech feature
extraction unit 83. Two input spectral subtraction is performed by subtracting vectors. That is,
when voiced in an environment in which the transfer characteristics are different for each band,
it is considered that the amount of voice wraparound to the noise microphone is different for
each band and for each noise microphone. In such a case, by using the ninth invention, the first
to Nth time-series feature vectors of noise are divided into a plurality of bands, and the one
having the minimum power for each band is selected, and each band is selected. By synthesizing
the feature vector of the noise using the feature value of the specific band of the specific noise
microphone which always has the smallest amount of voice wrap-around by combining and
outputting the minimum value of L There is an effect that it is possible to prevent the
deterioration of the recognition rate and the deterioration of the communication intelligibility.
[0034]
The operation of the tenth invention will be described with reference to FIG. FIG. 10 includes, in
addition to the configuration of the noise removal apparatus shown in FIG. 1, a noise section
03-05-2019
15
detection unit 8 that detects a section in which no voice exists using the output signal obtained
from the voice microphone 1 as a noise section; The detection unit 5 is configured to select the nth time-series feature vector of noise using the first to N-th time-series feature vectors of the
noise of the noise period detected by the noise-interval detection unit 8. That is, in addition to the
effect of the first invention, the tenth invention is more correct because it selects one of the first
to N-th time-series feature vectors of noise using a noise section in which speech is not mixed. It
is possible to estimate noise and obtain an effect that noise removal is enhanced.
[0035]
The operation of the eleventh invention will be described with reference to FIG. 11 includes, in
addition to the configuration of the noise removal apparatus shown in FIG. 2, a noise section
detection unit 18 that detects a section in which no voice is present as a noise section using an
output signal obtained from the voice microphone 11 The power detection unit 14 is configured
to select an output signal of the nth noise microphone using the output signals of the first to Nth
noise microphones in the noise period detected by the noise period detection unit 18. That is, in
the eleventh invention, in addition to the effect of the second invention, in order to select one of
the outputs of the first to Nth noise microphones using a noise section in which no voice is
mixed, noise is more correctly It is possible to estimate and obtain an effect that noise removal is
enhanced.
[0036]
The operation of the twelfth aspect of the invention will be described with reference to FIG. FIG.
12 has a noise section detection unit 31 which detects, as a noise section, a section in which no
voice exists using an output signal obtained from the voice microphone 21 in addition to the
configuration of the noise removal apparatus shown in FIG. The degree calculating unit 25 is
configured to calculate and output first to Nth similarities using first to Nth time-series feature
vectors of noise of the noise section detected by the noise section detecting unit 31. . Although
FIG. 12 shows an example applied to FIG. 3, the same configuration can be applied to the
embodiment shown in FIG. That is, in addition to the effects possessed by the third or fourth
invention, the twelfth invention selects one of the outputs of the first to N-th time-series feature
vectors of noise using a noise section in which no voice is mixed Therefore, it is possible to
estimate noise more correctly, and to obtain an effect that the noise removal effect is enhanced.
[0037]
03-05-2019
16
The operation of the thirteenth invention will be described with reference to FIG. 13 includes, in
addition to the configuration of the noise removal apparatus shown in FIG. 5, a noise section
detection unit 51 that detects a section in which no voice exists using the output signal obtained
from the voice microphone 41 as a noise section; The band similarity calculation unit 46
calculates and outputs the first to Nth similarities using the first to Nth time series feature
vectors of the noise sub-bands of the noise section detected by the noise section detection unit
51 Is configured. That is, in addition to the effect possessed by the fifth invention, the thirteenth
invention is more accurate in noise because it selects one of the outputs of the first to Nth noise
microphones using a noise section in which no voice is mixed. It is possible to estimate and
obtain an effect that noise removal is enhanced.
[0038]
The operation of the fourteenth invention will be described with reference to FIG. FIG. 14 is
configured such that, in the noise removal apparatus shown in FIG. 10, the noise section
detection unit 9 detects a section in which no voice is present as a noise section using the feature
vector output from the 2-input subtraction section 7 . Although FIG. 14 shows an example
applied to FIG. 10, the same configuration can be applied to the noise removal device shown in
FIG. 11, FIG. 12 or FIG. That is, in the fourteenth invention, in addition to the effects possessed by
the tenth, eleventh, or thirteenth invention, detection of a noise segment is performed by
estimating the noise segment using a clear time-series feature vector after noise removal. The
accuracy is improved, which has the effect of enabling more sophisticated noise removal.
[0039]
The operation of the fifteenth invention will be described with reference to FIG. FIG. 15 shows, in
addition to the configuration of the noise removal apparatus shown in FIG. 3, a noise section
detection unit 31 that detects a section in which no voice exists using the output signal obtained
from the voice microphone 21 as a noise section; In the noise section detected by the noise
section detection section 31 instead of the section 26, the highest similarity is selected out of the
first to the N-th similarities, and when the noise section detection section 31 has not detected the
noise section There is provided a maximum / minimum value detection unit 32 which selects the
minimum similarity among the ˜ Nth similarity. The noise section detection unit 31 can also be
configured to detect a section in which no speech is present as a noise section using the feature
vector output from the 2-input subtraction section 28. Although FIG. 15 shows an example
03-05-2019
17
applied to FIG. 3, the same configuration can be applied to the noise removal apparatus shown in
FIG. 4 or FIG. That is, according to the fifteenth invention, in addition to the effects possessed by
the third, fourth or fifth invention, an output of the noise microphone which is not most similar
to the output signal of the voice microphone in a section where voice other than the noise section
is present Select a signal. As a result, it is possible to select the output signal from the noise
microphone with the smallest amount of voice sneaking into the noise microphone, and prevent
the deterioration of the recognition rate and the decrease in communication intelligibility due to
the mixing of the voice into the noise microphone.
[0040]
Next, an embodiment of the present invention will be described with reference to the drawings.
[0041]
FIG. 1 is a block diagram showing an embodiment of the first invention.
The noise removing device shown in FIG. 1 comprises a voice microphone 1 mainly receiving
voice and two or more first to Nth noise microphones 2 mainly input ambient noise and arranged
around the voice microphone, a voice microphone A voice feature extraction unit 3 that converts
an output signal of 1 into a time-series feature vector of voice and output signals of two or more
of the first to Nth noise microphones 2 are converted to first to N-th time-series feature vectors
of noise Of the first to Nth time-series feature vectors of the noise obtained from the first to Nth
noise feature extraction unit 4 and the first to Nth noise feature extraction unit 4 to be converted,
a noise detection unit 5 for selecting n time series feature vectors (n = 1 to N); a selection unit 6
for selecting and outputting an nth time series feature vector of noise selected by the noise
detection unit 5; Time series features of voice output by Selector 6 has a two-input subtraction
unit 7 and subtracting the n-th time-series feature vector of the output noise from the spectrum.
[0042]
The voice containing noise is converted into an electrical signal by the voice microphone 1. At
the same time, ambient noise is converted into an electrical signal by two or more first to Nth
noise microphones 2 placed around the voice microphone 1. There are many installation
methods of two or more first to N noise microphones 2. For example, they may be arranged
around an audio microphone while keeping an appropriate distance, or to cope with noise
03-05-2019
18
coming from all directions. It may be arranged radially or directed towards a specific noise
source. The electrical signal obtained from the voice microphone 1 is converted into a time-series
feature vector of voice in the voice feature extraction unit 3, and the electrical signals obtained
from the two or more first to Nth noise microphones 2 are The first to Nth noise feature
extraction unit 4 converts the noise into first to Nth time-series feature vectors. The noise
detection unit 5 selects the nth time series feature vector of noise closest to the ambient noise
from among the first to Nth time series feature vectors of the noise obtained from the first to Nth
noise feature extraction section 4 . The n-th time-series feature vector of noise selected in the
noise detection unit 5 is selected and output in the selection unit 6. The 2-input subtraction unit
7 performs 2-input spectral subtraction by subtracting the n-th time-series feature vector of the
noise output from the selection unit 6 from the time-series feature vector of the noise-containing
voice output from the voice microphone 1 , Remove the noise contained in the voice. The 2-input
subtraction unit 7 has the same function as the 2-input subtraction unit 205 shown in FIG.
[0043]
FIG. 2 is a block diagram showing an embodiment of the second invention. The noise removing
device shown in FIG. 2 includes a voice microphone 11 mainly inputting voice, and two or more
first to Nth noise microphones 12 mainly input ambient noise and arranged around the voice
microphone 11; A speech feature extraction unit 13 for converting an output signal of the speech
microphone 11 into a time-series feature vector of speech, and an nth noise microphone having
the smallest power among the output signals of two or more first to Nth noise microphones 12
Of the nth noise microphone selected by the minimum power detection unit 14 and an output
signal of the nth noise microphone output by the selection unit 15 From the time-series feature
vector of the speech output from the speech feature extraction unit 13 and the noise feature
extraction unit 16 that converts the noise into the time-series feature vector of noise Out section
16 and a two-input subtraction unit subtracting the time series feature vector of the output noise.
[0044]
The voice containing noise is converted into an electrical signal by the voice microphone 11. At
the same time, ambient noise is converted into electrical signals by two or more first to Nth noise
microphones 12 placed around the voice microphone 11. The electrical signal obtained from the
voice microphone 11 is converted into a time-series feature vector of voice in the voice feature
extraction unit 13. The voice feature extraction unit 13 has the same function as the voice
feature extraction unit 3 in FIG. The minimum power detection unit 14 selects an output signal of
03-05-2019
19
the nth noise microphone with the smallest power among the output signals of the two or more
first to Nth noise microphones 12. The output signal of the nth noise microphone selected by the
minimum power detection unit 14 is selected by the selection unit 15 and output. The output
signal of the nth noise microphone selected in the selection unit 15 is converted into a timeseries feature vector of noise in the noise feature extraction unit 16. The 2-input subtraction unit
17 has the same function as the 2-input subtraction unit 7 in FIG. 1, and performs 2-input
spectral subtraction by subtracting the time-series feature vector of the noise output from the
voice feature extraction unit 13.
[0045]
FIG. 3 is a block diagram showing an embodiment of the third invention. The noise removal
apparatus shown in FIG. 3 includes a voice microphone 21 mainly inputting voice and two or
more first to Nth noise microphones 22 mainly input ambient noise and arranged around the
voice microphone 21; A voice feature extraction unit 23 for converting an output signal of the
voice microphone 21 into a time-series feature vector of voice, and an output signal of two or
more of the first to Nth noise microphones 22 respectively represent first to Nth time series of
noise First to Nth noise feature extraction units 24 for converting into feature vectors, first to Nth
time series feature vectors for noises output from the first to Nth noise feature extraction units
24, and the voice feature extraction units 23 The similarity calculation unit 25 calculates and
outputs the first to N-th similarities to the time-series feature vector of the voice to be selected,
and the largest one of the first to N-th similarities output by the similarity calculation unit 25.
Select the nth similarity Selects and outputs the nth time-series feature vector of noise
corresponding to the nth similarity selected by the maximum value detection unit 26 among the
large value detection unit 26 and the first to Nth time-series feature vectors of noise The
selection unit 27 has a 2-input subtraction unit that subtracts the nth time-series feature vector
of the noise output from the selection unit 27 from the time-series feature vector of the sound
output from the voice feature extraction unit 23.
[0046]
The voice containing noise is converted into an electrical signal by the voice microphone 21. At
the same time, ambient noise is converted into electrical signals by two or more first to Nth noise
microphones 22 placed around the voice microphone 11. The electric signal obtained from the
speech microphone 21 is converted to a time-series feature vector of speech in the speech
feature extraction unit 23, and the output signals of the two or more first to Nth noise
microphones 22 are first to Nth, respectively. The noise feature extraction unit 24 converts the
03-05-2019
20
noise into first to N-th time-series feature vectors.
[0047]
The voice feature extraction unit 23 and the first to Nth noise feature extraction units 24 have
the same functions as the voice feature extraction unit 3 in FIG. 1. The similarity calculation unit
25 sets the first to Nth time-series feature vectors of noise output from the first to N-th noise
feature extraction unit 24 and the time-series feature vector of speech output from the speech
feature extraction unit 23. The first to N-th similarities are calculated and output, respectively.
[0048]
The maximum value detection unit 26 selects the largest n-th similarity among the first to N-th
similarities output by the similarity calculation unit 25. The selection unit 27 selects and outputs
an nth time-series feature vector of noise corresponding to the nth similarity selected by the
maximum value detection unit 26 among the first to Nth time-series feature vectors of noise. The
2-input subtraction unit 28 has the same function as the 2-input subtraction unit 7 in FIG. 1, and
from the time-series feature vector of the voice output by the voice feature extraction unit 23, the
nth time of the noise output by the selection unit 27 Two-input spectral subtraction is performed
by subtracting series feature vectors.
[0049]
FIG. 4 is a block diagram showing an embodiment of the fourth invention. In addition to the
configuration of the embodiment shown in FIG. 3, the noise removal apparatus shown in FIG. 4
adds a predetermined weight to the first to Nth similarities outputted by the similarity calculation
unit 25, The maximum value detection unit 26 selects the largest nth similarity among the
weighted first to Nth similarities output from the weight addition unit 29. Is configured as.
[0050]
FIG. 5 is a block diagram showing an embodiment of the fifth invention. The noise removal
apparatus shown in FIG. 5 includes a voice microphone 41 mainly inputting voice, and two or
03-05-2019
21
more first to Nth noise microphones 42 mainly input ambient noise and arranged around the
voice microphone 41; A voice feature extraction unit 43 that converts an output signal of the
voice microphone 41 into a time-series feature vector of voice; a voice partial feature extraction
unit 44 that converts an output signal of the voice microphone 41 into a time-series feature
vector of a voice partial band; First to Nth partial feature extraction units 45 for converting
output signals of two or more of the first to Nth noise microphones 42 into first to Nth timeseries feature vectors of a partial band of noise; Between the first to Nth time-series feature
vectors of the noise sub-band output by the N-part feature extraction unit 45 and the time-series
feature vectors of the speech sub-band output of the audio partial feature extraction unit 44
Maximum value for selecting the largest n-th similarity among the first to N-th similarities output
from the sub-band similarity calculation unit 46 that calculates and outputs the first to N-th
similarities and the sub-band similarity calculation unit 46 The output signal from the nth noise
microphone 45 corresponding to the nth similarity selected by the maximum value detection unit
47 among the output signals of the detection unit 47 and the two or more first to Nth noise
microphones 42 is selected And a noise feature extraction unit 49 for converting an output
signal from the n-th noise microphone 45 output by the selection unit 48 into a time-series
feature vector of noise, and a voice output from the voice feature extraction unit 43 The twoinput subtraction unit 50 is obtained by subtracting the time-series feature vector of noise output
from the noise feature extraction unit 49 from the time-series feature vector of
[0051]
The voice containing noise is converted into an electrical signal by the voice microphone 41. At
the same time, ambient noise is converted into electrical signals by the two or more first to Nth
noise microphones 42. An output signal obtained from the audio microphone 41 is converted
into a time-series feature vector of audio in the audio feature extraction unit 43, and at the same
time, an output signal of the audio microphone 41 is processed in time series of audio partial
bands in the audio partial feature extraction unit 44 Converted to feature vector. The voice
feature extraction unit 43 has the same function as the voice feature extraction unit 3 in FIG. The
output signals of the two or more first to Nth noise microphones 42 are converted to first to Nth
time-series feature vectors of the noise partial band in the first to Nth partial feature extraction
units 45, respectively. The sub-band similarity calculation unit 46 calculates the first to Nth timeseries feature vectors of the noise sub-band output from the first to N-th partial feature
extraction unit 45 and the audio part output from the audio partial feature extraction unit 44.
The first to N-th similarities between time series feature vectors of the band are calculated and
output, respectively. The maximum value detection unit 47 selects the largest n-th similarity
among the first to N-th similarities output from the sub-band similarity calculation unit 46. The
selection unit 48 selects an output signal from the nth noise microphone 45 corresponding to the
nth similarity selected by the maximum value detection unit 47 among the output signals of the
03-05-2019
22
two or more first to Nth noise microphones 42. Output. The output signal from the n-th noise
microphone 45 obtained from the selection unit 48 is converted into a time-series feature vector
of noise in the noise feature extraction unit 49. The noise feature extraction unit 49 has the same
function as the voice feature extraction unit 3 in FIG. The 2-input subtraction unit 50 has the
same function as that of the 2-input subtraction unit 7 in FIG. 1, and the noise feature extraction
unit 49 outputs noise from the time-series feature vector of the voice output by the voice feature
extraction unit 43. Two-input spectral subtraction is performed by subtracting series feature
vectors.
[0052]
FIG. 6 is a block diagram showing an embodiment of the sixth invention. The noise removal
apparatus shown in FIG. 6 is a minimum value detection unit 30 for obtaining the largest n-th
similarity among the input first to N-th similarity, instead of the maximum value detection unit
26 in the embodiment shown in FIG. have. Although the example applied to FIG. 3 is shown in the
present embodiment, the same configuration can be applied to the embodiment shown in FIG. 4
or FIG.
[0053]
FIG. 7 is a block diagram showing an embodiment of the seventh invention. The noise removal
apparatus shown in FIG. 7 includes an audio microphone 61 mainly inputting speech, two or
more first to Nth microphones 62 mainly input ambient noise, and disposed around the audio
microphone 61, and speech A voice feature extraction unit 63 for converting an output signal of
the microphone 61 into a time-series feature vector of voice and an output signal of two or more
first to Nth noise microphones 62 respectively represent first to N-th time-series feature vectors
of noise The first to Nth time-series feature vectors of the noise obtained from the first to Nth
noise feature extraction unit 64 and the first to Nth noise feature extraction unit 64 that convert
to From the time series feature vector of the speech output by the speech feature extraction unit
63 and the mean value synthesis unit 65 that outputs the synthesized vector of noise, the noise
synthesis vector output by the average value synthesis unit 65 is subtracted 2 And a force
subtraction unit 66.
[0054]
03-05-2019
23
The voice containing noise is converted into an electrical signal by the voice microphone 61. At
the same time, ambient noise is converted into electrical signals by two or more first to Nth noise
microphones 62. An output signal of the voice microphone 61 is converted into a time-series
feature vector of voice in the voice feature extraction unit 63, and output signals of two or more
first to Nth noise microphones 62 are first to Nth noise feature extraction units. At 64, they are
converted into first to Nth time-series feature vectors of noise. The voice feature extraction unit
63 and the first to Nth noise feature extraction units 64 have the same functions as the voice
feature extraction unit 3 in FIG. 1. The first to Nth time-series feature vectors of noise obtained
from the first to Nth noise feature extraction unit 64 are averaged by the average value
combining unit 65 and output as a combined vector of noise. The 2-input subtraction unit 66 has
the same function as that of the 2-input subtraction unit 7 in FIG. 1 and combines the noise
output from the average value combining unit 65 from the time-series feature vector of the voice
output from the voice feature extraction unit 63. Two input spectral subtraction is performed by
subtracting the vectors.
[0055]
FIG. 8 is a block diagram showing an eighth embodiment of the present invention. In the noise
removal apparatus shown in FIG. 8, instead of the average value combining unit 65 in the
embodiment shown in FIG. 7, the first to Nth time-series feature vectors of noise output from the
first to Nth noise feature extraction units A weighted average value combining unit 67 is
provided which adds a predetermined weight and then averages the weighted average value, and
outputs the averaged feature vector as a combined vector of noise.
[0056]
FIG. 9 is a block diagram showing an embodiment of the ninth invention. The noise removing
device shown in FIG. 9 includes a voice microphone 81 mainly inputting voice, and two or more
first to Nth noise microphones 82 mainly input ambient noise and arranged around the voice
microphone 81; A voice feature extraction unit 83 for converting an output signal of the voice
microphone 81 into a time-series feature vector of voice and an output signal of two or more of
the first to Nth noise microphones 82 respectively represent first to Nth time series of noise The
first to Nth noise feature extraction unit 84 for converting into feature vectors and the first to
Nth time-series feature vectors of noise output from the first to Nth noise feature extraction unit
84 are divided into a plurality of bands. The minimum power is extracted for each band of the
time division feature vector of the noise after band division which the division unit 85 and the
division unit 85 output, and the minimum value of each band is synthesized to obtain noise
03-05-2019
24
Composite vector And a 2-input subtraction unit 87 that subtracts the synthetic vector of the
noise output from the minimum value synthesis unit 86 from the time-series feature vector of the
speech output from the speech feature extraction unit 83. ing.
[0057]
The voice containing noise is converted into an electrical signal by the voice microphone 81. At
the same time, ambient noise is converted into an electrical signal by two or more first to Nth
noise microphones 82 installed around the voice microphone 81. The output signal of the voice
microphone 81 is converted into a time-series feature vector of voice in the voice feature
extraction unit 83, and the output signals of the two or more first to Nth noise microphones 82
are respectively extracted by the first to Nth noise feature In the unit 84, noise is converted into
first to Nth time-series feature vectors. The voice feature extraction unit 83 and the first to Nth
noise feature extraction units 84 have the same functions as the voice feature extraction unit 3 in
FIG. 1. The first to Nth time-series feature vectors of noise output from the first to Nth noise
feature extraction unit 84 are each divided into a plurality of bands in the division unit 85 and
output. The minimum value combining unit 86 extracts the minimum power for each band of the
noise time-series feature vector after band division output from the dividing unit 85, combines
the respective minimum values for each band, and generates a noise combination vector Output
as The 2-input subtraction unit 87 has the same function as that of the 2-input subtraction unit 7
in FIG. 1 and is a synthesized vector of noise output from the minimum value synthesis unit 86
from the time-series feature vector of speech output by the speech feature extraction unit 83
Perform two-input spectral subtraction by subtracting.
[0058]
FIG. 10 is a block diagram showing an embodiment of the tenth invention. In addition to the
configuration of the embodiment shown in FIG. 1, the noise removal apparatus shown in FIG. 10
includes a noise section detection unit 8 for detecting a section in which no voice exists using a
feature vector obtained from the voice microphone 1 as a noise section. And the noise detection
unit 5 is configured to select the nth time-series feature vector of noise using the first to N-th
time-series feature vectors of noise of the noise period detected by the noise period detection
unit 8. ing.
[0059]
03-05-2019
25
FIG. 11 is a block diagram showing an eleventh embodiment of the present invention. In addition
to the configuration of the embodiment shown in FIG. 2, the noise removal apparatus shown in
FIG. 11 includes a noise section detection unit 18 for detecting a section in which no voice exists
using an output signal obtained from the voice microphone 11 as a noise section. And the
minimum power detection unit 14 is configured to select the output signal of the nth noise
microphone using the output signals of the first to Nth noise microphones of the noise period
detected by the noise period detection unit 18. .
[0060]
FIG. 12 is a block diagram showing an embodiment of the twelfth invention. In addition to the
configuration of the embodiment shown in FIG. 3, the noise removal apparatus shown in FIG. 12
includes a noise section detection unit 31 that detects a section in which no voice exists using an
output signal obtained from the voice microphone 21 as a noise section. The similarity
calculating unit 25 calculates and outputs the first to Nth similarities using the first to Nth timeseries feature vectors of the noise of the noise period detected by the noise period detecting unit
31; Is configured as. Although the present embodiment shows an example applied to FIG. 3, the
same configuration can be applied to the embodiment shown in FIG.
[0061]
FIG. 13 is a block diagram showing an embodiment of the thirteenth invention. In addition to the
configuration of the embodiment shown in FIG. 5, the noise removal apparatus shown in FIG. 13
detects a section in which no voice is present as a noise section using an output signal obtained
from the voice microphone 41. The sub-band similarity calculation unit 46 uses the first to N-th
time-series feature vectors of the noise sub-band of the noise section detected by the noise
section detection section 51 to generate Are configured to calculate and output.
[0062]
FIG. 14 is a block diagram showing an embodiment of the fourteenth invention. In the
configuration of the embodiment shown in FIG. 10, in the configuration of the embodiment
shown in FIG. 10, in the noise removal apparatus shown in FIG. 14, the noise segment detection
unit 9 detects a segment without voice as a noise segment using a feature vector output by the 2-
03-05-2019
26
input subtraction unit 7 Is configured as. In the present embodiment, an example applied to FIG.
10 is shown, but the same configuration can be applied to the embodiment shown in FIG. 11, FIG.
12 or FIG.
[0063]
FIG. 15 is a block diagram showing an embodiment of the fifteenth invention. In addition to the
configuration of the embodiment shown in FIG. 3, the noise removal apparatus shown in FIG. 15
detects a section without voice as a noise section using an output signal obtained from the voice
microphone 21. In the noise section detected by the noise section detecting section 31 instead of
the maximum value detecting section 26, the largest similarity is selected out of the first to N
similarities, and the noise section detecting section 31 detects the noise section. If not, it has a
maximum / minimum value detection unit 32 that selects the lowest similarity among the first to
Nth similarities. The noise section detection unit 31 can also be configured to detect a section in
which no speech is present as a noise section using the feature vector output from the 2-input
subtraction section 28. Although the example applied to FIG. 3 is shown in the present
embodiment, the same configuration can be applied to the embodiment shown in FIG. 4 or FIG.
[0064]
As described above, according to the noise removing apparatus of the present invention, the
noise component contained in the voice microphone is estimated by using a plurality of noise
microphones to remove the noise component, so that the characteristics are temporally and
spatially separated. It is possible to perform noise removal efficiently even for changing nonstationary noise, and perform stable noise removal without removing a necessary voice signal
even when voice mixing into a noise microphone occurs. Have the effect of
[0065]
Brief description of the drawings
[0066]
1 is a block diagram showing an embodiment of the first invention.
[0067]
2 is a block diagram showing an embodiment of the second invention.
03-05-2019
27
[0068]
3 is a block diagram showing an embodiment of the third invention.
[0069]
4 is a block diagram showing an embodiment of the fourth invention.
[0070]
5 is a block diagram showing an embodiment of the fifth invention.
[0071]
6 is a block diagram showing an embodiment of the sixth invention.
[0072]
FIG. 7 is a block diagram showing an embodiment of the seventh invention.
[0073]
FIG. 8 is a block diagram showing an eighth embodiment of the present invention.
[0074]
FIG. 9 is a block diagram showing an embodiment of the ninth invention.
[0075]
10 is a block diagram showing an embodiment of the tenth invention. FIG.
[0076]
FIG. 11 is a block diagram showing an eleventh embodiment of the present invention.
[0077]
12 is a block diagram showing an embodiment of the twelfth invention. FIG.
03-05-2019
28
[0078]
FIG. 13 is a block diagram showing an embodiment of the thirteenth invention.
[0079]
FIG. 14 is a block diagram showing an embodiment of the fourteenth invention.
[0080]
FIG. 15 is a block diagram showing an embodiment of the fifteenth invention.
[0081]
16 is a block diagram showing a conventional two-input spectral subtraction denoising
apparatus.
[0082]
Explanation of sign
[0083]
1, 11, 21, 41, 61, 81, 201 voice microphones 2, 12, 22, 42, 62, 82, 202 noise microphones 3,
13, 23, 43, 63, 83, 203 voice feature extraction unit 4, 16 , 24, 49, 64, 84 Noise feature
extraction unit 5 Noise detection unit 6, 15, 27, 48 Selection unit 7, 17, 28, 50, 66, 87, 205 2input subtraction unit 8, 9, 18, 31, 51 noise section detection unit 14 minimum power detection
unit 25 similarity calculation unit 26, 47 maximum value detection unit 29 weight addition unit
30 minimum value detection unit 32 maximum / minimum value detection unit 44 audio partial
feature extraction unit 45 partial feature extraction unit 46 Subband similarity calculation unit
65 average value combining unit 67 weighted average value combining unit 85 dividing unit 86
minimum value combining unit
03-05-2019
29
1/--страниц
Пожаловаться на содержимое документа