JP2018191255

Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2018191255
Abstract: [PROBLEMS] To provide a sound collection device etc. capable of extracting a target
sound more accurately than in the prior art using the MVDR method. A sound collection device
calculates a spatial correlation matrix for each frequency using microphone signals in the
frequency domain of N channels, and estimates values of strengths of incoming waves from K
directions from the spatial correlation matrix and each microphone An estimated value of noise
power included in the signal is determined, an estimated value kt of the arrival direction of the
target sound is determined, and a matrix A (f) <H> consisting of K vectors ak (f) and an N × N
unit matrix; Using a matrix Vs (f, l) in which elements other than the elements of (kt, kt) in the
diagonal matrix whose diagonal components are the estimated value of the intensity and the
estimated value of the noise power are 0, The estimated value of the correlation matrix and the
estimated value of the correlation matrix of the non-target sound are determined, the filter
coefficient vector is determined using the estimated value of the correlation matrix, the filter
coefficient vector is applied to the microphone signal, and the output signal is determined.
[Selected figure] Figure 2
Sound collecting device, method thereof and program
[0001]
The present invention relates to a sound collection device using a beamforming technique for
forming a beam using a plurality of microphones, a method thereof, and a program.
[0002]
03-05-2019
1
A plurality of microphones are installed in the sound field to acquire multi-channel microphone
signals, from which the target voice and sound (hereinafter also referred to as target sound) are
cleared, and noise and other voice (hereinafter also non-target sound) The need for technology to
remove and extract as much as possible is increasing in recent years.
For this purpose, beamforming techniques for forming beams using a plurality of microphones
have been extensively researched and developed in recent years.
[0003]
In the beam forming technique, as shown in FIG. 1, the filtering unit 92-n applies a filter to each
microphone signal yn (t) collected by N microphones 91-n (where n = 1, 2,..., N). Apply Here, t is
an index indicating time. Next, the adder 93 sums the output values of the filtering unit 92-n. The
calculated sum is output as the output signal z (t) of the sound collection device. With such a
configuration, noise can be significantly reduced and the target sound can be extracted more
clearly. As a method for obtaining such a beamforming filter, the minimum variance
distortionless response method (MVDR method) is often used (see Non-Patent Document 1).
[0004]
Habets, E., Benesty, J., Cohen, I., Gannot, S., Dmochowski, J., "New Insights Into the MVDR
Beamformer in Room Acoustics", IEEE Transactions on Audio, Speech, and Language Processing,
18, 1, pp. 158 - 170, 2010.
[0005]
In order to use the MVDR method, it is necessary to appropriately estimate the correlation matrix
of sounds other than the target sound (non-target sound) and the transfer characteristics from
the sound source position of the target sound to each microphone. However, the components
derived from the target sound and the components derived from the non-target sound are mixed
in the plurality of microphone signals, and the desired correlation matrix and transfer
characteristics can not be extracted as it is. Therefore, the MVDR method alone can not clearly
extract the target voice from only the microphone signal.
03-05-2019
2
[0006]
Therefore, in the present invention, it is an object of the present invention to provide a sound
collection device that estimates each correlation matrix from a microphone signal in which target
sound and non-target sound are mixed, and extracts the target sound using the MVDR method,
its method and program. I assume.
[0007]
In order to solve the above problems, according to one aspect of the present invention, the sound
collection device sets N and K to be any integers of 2 or more, and n = 1, 2,. Space correlation
matrix R (f, l) is calculated for each frequency using microphone signals Y n (f, l) in the frequency
domain of N channels as,, ..., K, and spatial correlation matrix R (f, l) The arrival wave
decomposition unit which obtains the estimated value pk (f, l) of the strength of the incoming
wave from K directions from k and the estimated value qn (f, l) of the noise power included in
each microphone signal Yn (f, l) And a target sound determination unit for obtaining an estimated
value kt of the arrival direction of the target sound, and a vector consisting of an output signal of
the microphone array when a plane wave of amplitude 1 reaches the microphone array
consisting of N microphones from the kth direction. Let ak (f) be a matrix A (f) <H> = [a1 (f) a2 (f)
... aK (f) IN] consisting of K vectors ak (f) and an N × N unit matrix IN , Estimate of intensity pk (f,
l) and noise power Using a matrix Vs (f, l) in which all elements other than the elements of (kt, kt)
in the diagonal matrix V (f, l) have constant values qn (f, l) as diagonal components A correlation
matrix synthesis unit for obtaining an estimated value R ^ T (f, l) of the correlation matrix of the
target sound and an estimated value R ^ NT (f, l) of the correlation matrix for the non-target
sound, and an estimated value R ^ of the correlation matrix Find the filter coefficient vector h (f,
l) using T (f, l) and R ^ NT (f, l) and apply the filter coefficient vector h (f, l) to the microphone
signal Yn (f, l) And an array filtering unit for obtaining an output signal z (f, l).
[0008]
In order to solve the above problems, according to another aspect of the present invention, the
sound collection method makes N and K each be any integer of 2 or more, and n = 1, 2,. Space
correlation matrix R (f, l) is calculated for each frequency using microphone signals Y n (f, l) in
the frequency domain of N channels as 1, 2, ..., K, and spatial correlation matrix R (f, l) Arrival
wave decomposition to obtain estimated value pk (f, l) of the strength of the arrival wave from K
directions) and estimated value qn (f, l) of noise power included in each microphone signal Yn (f,
l) It consists of the step, the target sound judgment step of finding the estimated value kt of the
direction of arrival of the target sound, and the output signal of the microphone array when the
plane wave of amplitude 1 reaches the microphone array consisting of N microphones from the
kth direction. Let the vector be ak (f), and a matrix A (f) <H> = [a1 (f) a2 (f) ... aK (f) IN] consisting
of K vectors ak (f) and an N × N unit matrix IN. And the estimated value pk (f, l) of the intensity
03-05-2019
3
A matrix Vs (f, l) in which all elements other than the elements of (kt, kt) of the diagonal matrix V
(f, l) having the estimated value qn (f, l) of Ispower as a diagonal component A correlation matrix
synthesis step for obtaining an estimated value R ^ T (f, l) of the correlation matrix of the target
sound and an estimated value R ^ NT (f, l) of the correlation matrix of the non-target sound using
the estimation of the correlation matrix A filter coefficient vector h (f, l) is obtained using the
values R ^ T (f, l) and R ^ NT (f, l), and a filter coefficient vector h (f, l) is obtained for the
microphone signal Yn (f, l). And the array filtering step to obtain the output signal z (f, l).
[0009]
According to the present invention, the target sound can be extracted with higher accuracy than
in the prior art by using the MVDR method.
[0010]
The functional block diagram of the sound collection apparatus which concerns on a prior art.
FIG. 2 is a functional block diagram of the sound collection device according to the first
embodiment and the second embodiment.
The figure which shows the example of the processing flow of the sound collection apparatus
which concerns on 1st embodiment and 2nd embodiment.
[0011]
Hereinafter, embodiments of the present invention will be described.
In the drawings used in the following description, the same reference numerals are given to
constituent parts having the same functions and steps for performing the same processing, and
redundant description will be omitted.
In the following description, the symbol ^ or the like used in the text should be written
directly above the preceding character, but due to the limitation of the text notation, it is written
03-05-2019
4
immediately after the character. In the formula, these symbols are described at their original
positions. Moreover, the processing performed in each element unit of a vector or a matrix is
applied to all elements of the vector or the matrix unless otherwise noted.
[0012]
First Embodiment FIG. 2 shows a functional block diagram of the sound collection device 100
according to the first embodiment, and FIG. 3 shows its processing flow.
[0013]
The sound collection device 100 of the present embodiment receives an output signal
(microphone signal) yn (t) of a microphone array consisting of N microphones 91-n.
For example, the microphones 91-n are nondirectional microphone elements. N is any integer of
2 or more, and n = 1, 2,. The sound collection device 100 of this embodiment extracts the
estimated value R ^ NT (f, l) of the correlation matrix of non-target sound from the microphone
signal yn (t) of the N channel, and extracts the component of the target sound by the MVDR
method. And output the extracted signal as an output signal z (t).
[0014]
The sound collection device 100 is configured by a computer provided with a CPU, a RAM, and a
ROM storing a program for executing the following processing, and is functionally configured as
follows.
[0015]
The sound collection device 100 includes a noise / arrival wave decomposition unit 101, a target
sound determination unit 103, a correlation matrix synthesis unit 104, an array filtering unit
105, a Fourier transform unit 107, and an inverse Fourier transform unit 108.
[0016]
<Fourier Transform Unit 107> The Fourier transform unit 107 receives microphone signals yn (t)
in the time domain of N channels as input, and performs short-time Fourier transform on the
03-05-2019
5
microphone signals Yn (f, l) in the frequency domain for each frame l. Convert (S107) and output.
The conversion result at the frequency f, frame l
[0017]
[0018]
Handle it like vector.
The microphone signal y (f, l) consisting of the microphone signal Yn (f, l) in the frequency
domain of N channels is expressed by the following equation: y (f, l) = x (f, l) + v (f, l) Is
decomposed into a multi-channel signal x (f, l) consisting of direct waves of the target sound and
a multi-channel signal v (f, l) consisting of its reflection and reverberation components and noise.
[0019]
<Noise and Arrival Wave Decomposition Unit 101> The noise and arrival wave decomposition
unit 101 receives the microphone signal y (f, l) in the frequency domain and uses the
microphone signal y (f, l) at the frequency f and the frame l. , Its spatial correlation matrix R (f, l)
is calculated.
For example, it calculates by following Formula. R (f, l) = E [y (f, l) y (f, l) <H>] (2)
[0020]
However, E [] means to take the expected value. Further, y (f, l) <H> is a vector obtained by
transposing y (f, l) and taking a complex conjugate. In actual processing, short-time averaging is
usually used instead of E []. Then, estimated values pk (f, l) of the strengths of the arriving waves
from the K directions from the spatial correlation matrix R (f, l) and estimated values q n of noise
03-05-2019
6
power contained in each microphone signal Y n (f, l) (f, l) is obtained (S101), and a diagonal
matrix V (f, l) having pk (f, l) and qn (f, l) as diagonal components is output. However, k is an
index of the direction of arrival, K direction is assumed as the arrival possible direction of the
plane wave, and k = 1, 2,. Therefore, the diagonal matrix V (f, l) is expressed as follows.
[0021]
[0022]
Note that K> N.
As a method of estimating the estimated value pk (f, l) of the intensity and the estimated value qn
(f, l) of the noise power, for example, the method of reference 1 can be used. (Reference 1) P.
Stoica, P. Babu, and J. Li, "SPICE A sparse covariance-based estimation method for array
processing", IEEE Transactions on signal processing, vol. 59, no. 2, 2011, 629-638.
[0023]
In this method, it is assumed in advance that the K direction (> N) is a direction in which plane
waves can arrive. When a plane wave of amplitude 1 reaches the microphone array from the kth
direction at frequency f, the response (output signal) of each microphone is ak (f) = [ak, 1 (f) ak, 2
(f) ... Let ak, N (f)] <T>. ak, n (f) represents the response (output signal) of the n-th microphone to
the incoming plane wave of amplitude 1 from the k-th direction at the frequency f. Note that ak
(f) is obtained in advance prior to sound collection. However, ak (f) may be obtained in advance
by experiment (actual measurement) or simulation, or a theoretical value by calculation may be
used. Using a matrix A (f) <H> = [a1 (f) a2 (f) ... aK (f) IN] consisting of K response vectors ak (f)
and an N × N identity matrix IN, (3) In reference 1, the matrix R (f, l) is converted to the matrix A
(f) <H> in the form of R (f, l) = A (f) <H> V (f, l) A (f) (4) , And is decomposed into the product of
diagonal matrix V (f, l) and matrix A (f). By this decomposition, an estimated value pk (f, l) of the
intensity of the plane wave from the k-th direction included in the diagonal matrix V (f, l) and an
estimated value qn of the noise power of the n-th microphone 91-n (f, l) are obtained. In practice,
the above decomposition is ¦¦ (A (f) <H> V (f, l) A (f)) <-1/2> (R (f, l) -A (f) < H> V (f, l) A (f)) R (f, l)
<-1/2> ¦¦ <2> (5) To find the diagonal matrix V (f, l) that minimizes It corresponds. In this
equation (5), ¦¦ x ¦¦ means taking the Frobenius norm of the matrix x.
03-05-2019
7
[0024]
<Target Sound Determination Unit 103> The target sound determination unit 103 obtains an
estimated value kt of the arrival direction of the target sound (S103), and outputs it. For example,
target sound determination section 103 receives diagonal matrix V (f, l) as input, and estimates
estimated value pk (f, l) of strength in each direction of arrival k included in diagonal matrix V (f,
l). The direction in which the intensity is the largest is determined to be the arrival direction of
the target sound (S103), and the determination result (estimated value of the arrival direction) kt
is output. In this example, the target sound determination unit 103 obtains the estimated value kt
of the arrival direction of the target sound using the estimated value pk (f, l) of the intensity in
the band 100 to 500 Hz in which the voice power is concentrated. In this band, the strength of
each arrival direction k is
[0025]
[0026]
になる。
In this example, f0 corresponds to 100 Hz and f1 corresponds to 500 Hz. It is determined that k
that maximizes b (k, l) is the arrival direction kt of the target sound at frame l. The incoming
wave from direction kt is regarded as the target sound, and the incoming wave from directions
other than kt is regarded as the non-target sound.
[0027]
<Correlation Matrix Combining Unit 104> The correlation matrix combining unit 104 receives
the diagonal matrix V (f, l) and the estimated value kt of the arrival direction as input, and the
matrix A (f) <H> = [a1 (f) a2 ( f) ... aK (f) IN], and a matrix Vs (f, l) (diagonal matrix V (f (f, l)) with
elements other than the elements of (kt, kt) of the diagonal matrix V (f, l) all 0 , l) is a matrix in
which all elements other than the element of (kt, kt) are made 0 among K pk (f, l) included, and it
comes from (the estimated value of) the arrival direction of the target sound By using the matrix
Vs (f, l) with only the intensity of the sound (an estimate of the sound) and the other elements set
03-05-2019
8
to 0, the estimated value R ^ T (f, l) of the correlation matrix of the target sound is not An
estimated value R ^ NT (f, l) of the correlation matrix of the target sound is determined (S104)
and output. As described above, ak (f) is obtained in advance prior to sound collection.
[0028]
For example, the correlation matrix synthesis unit 104 obtains the estimated value R ^ T (f, l) of
the correlation matrix of the target sound and the estimated value R ^ NT (f, l) of the correlation
matrix of the non-target sound by the following equation . R ^ T (f, l) = A (f) <H> Vs (f, l) A (f) R ^
NT (f, l) = A (f) <H> (V (f, l)- Vs (f, l)) A (f) (7)
[0029]
<Array Filtering Unit 105> The array filtering unit 105 estimates the microphone signal y (f, l) in
the frequency domain, the estimated value R ^ T (f, l) of the correlation matrix of the target
sound, and the estimated value of the correlation matrix of the non-target sound. Take R ^ NT (f,
l) as input.
[0030]
First, using the estimated value R ^ T (f, l) of the correlation matrix of the target sound and the
estimated value R ^ NT (f, l) of the correlation matrix of the non-target sound, a filter coefficient
vector (N-dimensional complex vector) h ( Find f, l).
For example, a filter coefficient vector is determined using the MVDR method based on the
estimated value R ^ NT (f, l) of the correlation matrix estimated. The MVDR method solves the
following constrained optimization problem and obtains its filter coefficient vector h (f, l).
[0031]
[0032]
Here, g <(d)> (f) is a vector consisting of frequency transfer characteristics of the direct path from
03-05-2019
9
the sound source position of the target sound to each microphone.
g1 <(d)> (f) is the frequency transfer characteristic of the direct path from the sound source
position of the target sound to the microphone 91-1 as a reference. In this example, the
microphone serving as the reference is the microphone 91-1, but any one of the other
microphones 91-2 to 91-N may be used as the reference.
[0033]
This constraint condition should be rewritten using the sound source signal S (f, l) (of the target
sound) and the signal component X1 <(d)> (f, l) that directly reaches the microphone 91-1 from
the sound source of the target sound. Can. Note that X1 <(d)> (f, l) = g1 <(d)> (f) S (f, l).
[0034]
From the right, S (f, l) X1 <(d) *> (f, l) is multiplied from the right to the above constraint
condition equation to obtain an expected value. However, superscript * means taking complex
conjugate. The rewritten constraint condition is h <H> (f, l) E [x <(d)> (f, l) X1 <(d) *> (f, l)] = E [X1
<(d)> (f, l) X1 <(d) *> (f, l)] (9) However, x <(d)> (f, l) is a vector of signal components that directly
reach each microphone from the sound source of the target sound.
[0035]
Here, E [] on the left side of Expression (9) is a first vertical vector of the estimated value R ^ T (f,
l) of the correlation matrix of the target sound. Further, E [] on the right side is the (1, 1) element
of the estimated value R ^ T (f, l) of the correlation matrix of the target sound. The source signal S
(f, l) of the target sound and the frequency transfer characteristic g <(d)> (f) from the source of
the target sound to each microphone are unknown. However, the statistical procedure taking the
above expected values enables the coefficients of the new constraint conditions to be obtained
from the estimated value R ^ T (f, l) of the target sound correlation matrix.
[0036]
03-05-2019
10
The array filtering unit 105 applies the obtained filter coefficient vector h (f, l) to the microphone
signal y (f, l) (see the following equation) to obtain an output signal z (f, l) (S105), and outputs Do.
z (f, l) = h <H> (f, l) y (f, l) (15) With such a configuration, it is possible to extract the component
of the frequency f of the target sound.
[0037]
<Inverse Fourier Transform Unit 108> The inverse Fourier transform unit 108 receives the
output signal z (f, l) in the frequency domain, performs inverse Fourier transform on the
processing result at all frequencies for a short time (S108), and outputs the time domain The
signal z (t) is obtained and output.
[0038]
<Effects> With the above configuration, it is possible to estimate each correlation matrix from the
microphone signal in which the target sound and the non-target sound are mixed, and to extract
the target sound using the MVDR method.
[0039]
Modified Example In the present embodiment, the diagonal matrix V (f, l) is output from the noise
/ arrival wave decomposition unit 101 to the target sound determination unit 103, but is
included in the diagonal matrix V (f, l). Only the estimated value pk (f, l) of the target intensity
may be output.
The point is that the target sound determination unit 103 only needs to be able to determine the
arrival direction of the target sound.
[0040]
In this embodiment, the sound collection device 100 includes the Fourier transform unit 107 and
the inverse Fourier transform unit 108. However, the Fourier transform unit 107 and the inverse
Fourier transform unit 108 are separate devices, and the sound collection device 100 has a
frequency The microphone signal y (f, l) of the area may be input, or the output signal z (f, l) of
the frequency domain may be output.
03-05-2019
11
[0041]
Second Embodiment A description will be made focusing on parts different from the first
embodiment.
[0042]
FIG. 2 shows a functional block diagram of the sound collection device 200 according to the
second embodiment, and FIG. 3 shows its processing flow.
[0043]
The sound collection device 200 includes a noise / arrival wave decomposition unit 101, an
intensity correction unit 202 (indicated by a broken line in FIG. 2), a target sound determination
unit 103, a correlation matrix synthesis unit 204, an array filtering unit 105, a Fourier transform
unit 107, The inverse Fourier transform unit 108 is included.
[0044]
<Intensity Correction Unit 202> The intensity correction unit 202 receives the diagonal matrix V
(f, l) as input, and the total signal power of the spatial correlation matrix R (f, l) and the matrix A
(f) <H> = [ a1 (f) a2 (f)... aK (f) IN] and the total signal power obtained from the diagonal matrix V
(f, l) to obtain a correction coefficient β (f, l) (S202, in FIG. 3) , Shown by dashed lines), output.
[0045]
For example, in the intensity correction unit 202, the correction coefficient β (f, l) is obtained by
the following equation.
[0046]
[0047]
Where tr () is a function that takes a matrix trace.
For example, the spatial correlation matrix R (f, l) may be the one calculated by the noise / arrival
decomposing unit 101, and the matrix A (f) <H> precedes sound collection by the method
described in the first embodiment. What has been obtained in advance may be used.
03-05-2019
12
[0048]
<Correlation Matrix Synthesizer 204> The correlation matrix synthesizer 204 estimates the
spatial correlation matrix R (f, l), the correction coefficient β (f, l), the diagonal matrix V (f, l), and
the estimated value of the direction of arrival. A spatial correlation matrix R (f, l), a correction
coefficient β (f, l), and a matrix A (f) <H> = [a1 (f) obtained in advance prior to sound collection,
with kt as an input Correlation using a2 (f) ... aK (f) IN] and a matrix Vs (f, l) in which all elements
other than the elements of (kt, kt) in the diagonal matrix V (f, l) are 0 An estimated value R ^ NT
(f, l) of the matrix is obtained (S204) and output.
For example, the estimated value R ^ NT (f, l) of the correlation matrix is obtained by the
following equation.
R ^ NT (f, l) = R (f, l)-β (f, l) A (f) <H> Vs (f, l) A (f) Note that the estimated value R of the
correlation matrix of the target sound ^ T (f, l) can be obtained by the same method as in the first
embodiment.
[0049]
<Effects> With this configuration, the same effects as those of the first embodiment can be
obtained.
Furthermore, by using the correction coefficient, the correlation matrix of non-target sound can
be better determined.
[0050]
<Other Modifications> The present invention is not limited to the above embodiment and
modifications.
03-05-2019
13
For example, the various processes described above may be performed not only in chronological
order according to the description, but also in parallel or individually depending on the
processing capability of the apparatus that executes the process or the necessity.
In addition, changes can be made as appropriate without departing from the spirit of the present
invention.
[0051]
<Program and Recording Medium> Further, various processing functions in each device
described in the above-described embodiment and modification may be realized by a computer.
In that case, the processing content of the function that each device should have is described by a
program.
By executing this program on a computer, various processing functions in each of the abovedescribed devices are realized on the computer.
[0052]
The program describing the processing content can be recorded in a computer readable
recording medium. As the computer readable recording medium, any medium such as a magnetic
recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory,
etc. may be used.
[0053]
Further, this program is distributed, for example, by selling, transferring, lending, etc. a portable
recording medium such as a DVD, a CD-ROM or the like in which the program is recorded.
Furthermore, the program may be stored in a storage device of a server computer, and the
program may be distributed by transferring the program from the server computer to another
computer via a network.
03-05-2019
14
[0054]
For example, a computer that executes such a program first temporarily stores a program
recorded on a portable recording medium or a program transferred from a server computer in its
own storage unit. Then, at the time of execution of the process, the computer reads the program
stored in its storage unit and executes the process according to the read program. In another
embodiment of the program, the computer may read the program directly from the portable
recording medium and execute processing in accordance with the program. Furthermore, each
time a program is transferred from this server computer to this computer, processing according
to the received program may be executed sequentially. In addition, a configuration in which the
above-described processing is executed by a so-called ASP (Application Service Provider) type
service that realizes processing functions only by executing instructions and acquiring results
from the server computer without transferring the program to the computer It may be Note that
the program includes information provided for processing by a computer that conforms to the
program (such as data that is not a direct command to the computer but has a property that
defines the processing of the computer).
[0055]
In addition, although each device is configured by executing a predetermined program on a
computer, at least a part of the processing content may be realized as hardware.
03-05-2019
15