JP2018142917

Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2018142917
Abstract: A plurality of sound sources can be localized simultaneously even when the number of
sound sources is unknown. A sound source position estimation unit 25 indicates a variable N and
a variable 表 す representing the positions of a plurality of sound sources k when observation
data Y is given, and an index of a sound source which becomes dominant at each time with
respect to each frequency. A variable Z representing an indicator, a variable V for determining
the probability πk that each sound source k becomes dominant, a variable λ representing the
variance of observed time-frequency components of each frequency for each of a plurality of
directions, and each sound source k and noise Objective function representing a divergence
representing the difference between the posterior distribution p (Θ ¦ Y) of the unknown variable
含 む including the variable ζ and the variable γ representing the power of the time frequency
component of each frequency and the variable function q (Θ) As parameters of the variables N,
ρ, Z, V, λ, ζ, and γ are estimated to minimize the objective function based on the variational
inference method, Estimate the location of sound source k Set. [Selected figure] Figure 3
Sound source localization apparatus, method, and program
[0001]
The present invention relates to a sound source localization apparatus, method, and program,
and more particularly to a sound source localization apparatus, method, and program for
estimating the position of a sound source from an acoustic signal.
[0002]
03-05-2019
1
Source localization has a wide range of applications, such as radar and sonar.
In particular, it is important to be able to localize and track moving wave sources instantaneously
in small arrays. Conventional methods for source localization problems include Multiple Signal
Classication (MUSIC) method, Generalized Cross-Correlation methods with Phase Transform
(GCC-PHAT) method, and methods based on source-constrained partial differential equations.
[0003]
Since the MUSIC method and the GCC-PHAT method assume a plane wave as a sound source and
use the difference in arrival time between sensors of each sound source as a clue for localization,
generally larger array sizes are advantageous. Also, since all of them are methods based on
statistics such as autocorrelation function and cross correlation function between received
signals of the sensor array, it is necessary to take a sufficiently long observation time width in
order to localize the sound source with high accuracy. For this reason, these methods are not
always suitable for source localization with only a small array size and instantaneous
observation. On the other hand, the method based on the source constrained partial differential
equation performs sound source localization based on the spatiotemporal partial differential
equation of the acoustic signal established at each time, and theoretically, the source localization
is performed only by instantaneous small area observation. It is possible to do.
[0004]
However, since this method is based on an equation that holds for a single wave source, it is said
that it is vulnerable when the observation acoustic signal deviates from the partial differential
equation as in the case of noise or multiple point sources. It has a drawback. In this framework,
in order to enable expansion to the case where random noise and multiple point sound sources
exist, a probability model of array observation signals based on a source constrained partial
differential equation and a source localization algorithm based on that are disclosed in NonPatent Document 1. Proposed.
[0005]
Atsushi Suzuki, Hirokazu Kameoka, "Probability modeling and multiple source localization
algorithm of acoustic signal based on source constrained difference equation," Proceedings of the
Acoustical Society of Japan, pp.615-618, March 2016
03-05-2019
2
[0006]
Although it is necessary to assume the number of sound sources in the method of Non-Patent
Document 1, the number of sound sources is often unknown in a real environment.
Therefore, it is desirable to be able to perform sound source localization according to the actual
number of sound sources while adapting the complexity of the model without assuming the
number of sound sources.
[0007]
The present invention has been made in view of the above circumstances, and it is an object of
the present invention to provide a sound source localization apparatus, method and program
capable of simultaneously localizing a plurality of sound sources even when the number of sound
sources is unknown. I assume.
[0008]
In order to achieve the above object, a sound source localization apparatus according to the
present invention estimates a position of each of a plurality of sound sources from an
observation signal in which sound source signals from a plurality of sound sources input by a
microphone array are mixed. A localization apparatus comprising: a spatial difference calculation
unit that calculates a difference between the observation signals input by a pair of microphones
arranged in the direction among the microphone arrays in each of a plurality of directions;
Among them, the observation time frequency component of each frequency is output with the
observation signal input by the reference microphone as an input, and the observation signal
calculated for each of the plurality of directions by the spatial difference calculation unit A time
frequency expansion unit which outputs an observation time frequency component of each
frequency in each of the plurality of directions with a difference as an input; At each frequency of
the reference microphone based on an observation time frequency component of each frequency
of the reference microphone and an observation time frequency component of each frequency
for each of the plurality of directions output by the time frequency expansion unit; A variable N
and a variable 表 す representing the positions of a plurality of sound sources k when
observation data Y consisting of observation time frequency components for each of the plurality
of directions and observation time frequency components are given, each for each frequency A
variable Z representing an indicator indicating an index of a sound source which becomes
dominant at time, a variable V for determining a probability π k which each sound source k
03-05-2019
3
becomes dominant, a dispersion of observation time frequency components of each frequency
with respect to each of the plurality of directions. Unknown variable including a variable λ
representing and a variable 表 す representing a power of time frequency components of each
frequency of each sound source k and noise With the objective function representing the
divergence representing the difference between the posterior distribution p (Θ ¦ Y) of Θ and the
invariant function q (Θ) as the objective function, the objective function is minimized based on
the variational inference method. A parameter estimation unit that estimates each parameter that
represents the distribution of each of the variable N, the variable 、, the variable Z, the variable
V, the variable λ, the variable ζ, and the variable γ, and the position of each estimated sound
source k And a sound source position estimation unit configured to estimate the position of each
sound source k based on a parameter representing the distribution of the variable N and a
parameter representing the distribution of the variable ρ.
[0009]
A sound source localization method according to the present invention is a sound source
localization method in a sound source localization apparatus for estimating the position of each
of the plurality of sound sources from observation signals in which sound source signals from a
plurality of sound sources input by a microphone array are mixed. The spatial difference
calculation unit calculates, for each of a plurality of directions, a difference between the
observation signals input by a pair of microphones aligned in the direction in the microphone
array, and the time-frequency expansion unit The observation time frequency component of each
frequency is output with the observation signal input by the reference microphone in the
microphone array as an input, and the space difference calculation unit calculates the direction
calculated for each of the plurality of directions. An observation time frequency component of
each frequency is output for each of the plurality of directions with the difference of the
observation signal as an input. The reference estimation unit is based on the observation time
frequency component of each frequency of the reference microphone and the observation time
frequency component of each frequency for each of the plurality of directions output by the time
frequency expansion unit. A variable N and a variable 表 す representing the positions of a
plurality of sound sources k when observation data Y consisting of an observation time
frequency component of each frequency of the microphone and an observation time frequency
component of each frequency in each of the plurality of directions are given. A variable Z
representing an indicator indicating an index of a sound source which becomes dominant at each
time for each frequency, a variable V for determining a probability πk where each sound source
k becomes dominant, observation of each frequency in each of the plurality of directions A
variable λ representing the variance of the time frequency component, and a variable 変 数 and
a variable γ representing the power of the time frequency component of each frequency of each
sound source k and noise Minimize the objective function based on the variational inference
method with the function representing the divergence representing the difference between the
posterior distribution p (Θ ¦ Y) of the unknown variable 含 む including and the invariant
03-05-2019
4
function q (Θ) as the objective function As described above, each parameter representing the
distribution of each of the variable N, the variable 、, the variable Z, the variable V, the variable
λ, the variable ζ, and the variable γ is estimated, and the sound source position estimating unit
estimates each parameter The position of each sound source k is estimated based on the
parameter representing the distribution of the variable N representing the position of the sound
source k and the parameter representing the distribution of the variable ρ.
[0010]
A program according to the present invention is a program for causing a computer to function as
each part of the sound source localization apparatus described above.
[0011]
As described above, according to the sound source localization apparatus, method, and program
of the present invention, the observation time frequency component of each frequency of the
reference microphone and the observation time frequency component of each frequency in each
of the plurality of directions Variables N and 表 す representing positions of a plurality of sound
sources k when given observation data Y, a variable Z representing an indicator representing an
index of a sound source which becomes dominant at each time with respect to each frequency,
and each sound source k A variable V for determining a dominant probability π k, a variable λ
representing the dispersion of observed time-frequency components of each frequency in each of
the plurality of directions, and powers of time-frequency components of each frequency of each
sound source k and noise A function representing the divergence representing the difference
between the posterior distribution p (Θ ¦ Y) of the unknown variable 含 む including the variable
表 す and the variable γ and the invariant function q (Θ) As each variable representing the
distribution of each of the variable N, the variable 、, the variable Z, the variable V, the variable
λ, the variable 前 記, and the variable γ so as to minimize the objective function based on the
variational inference method. By estimating the parameters and estimating the position of each
sound source k, it is possible to simultaneously localize a plurality of sound sources even when
the number of sound sources is unknown.
[0012]
It is a figure which shows the spherical wave which arrives at the observation point r from a
point sound source.
It is a figure showing an example of arrangement of a microphone array.
03-05-2019
5
It is the schematic which shows the structure of the sound source localization apparatus which
concerns on embodiment of this invention.
It is a flowchart which shows the content of the sound source localization process routine in the
sound source localization apparatus based on embodiment of this invention.
It is a figure which shows the sound source position in an experiment, and a microphone
position.
It is a figure which shows the experimental result in the conditions of reverberation time 0.5 s,
and observation length 2779 ms. It is a figure which shows the experimental result in the
conditions of reverberation time 0.5 s, and observation length 1665 ms. It is a figure which
shows the experimental result in the conditions of reverberation time 0.2 s, and observation
length 2779 ms. It is a figure which shows the experimental result on the conditions of
reverberation time 0.2 s, and observation length 1665 ms.
[0013]
Hereinafter, embodiments of the present invention will be described in detail with reference to
the drawings. The technique proposed in the present invention is a signal processing technique
aiming to estimate a wave source position from an acoustic signal.
[0014]
<Summary of the Embodiment of the Present Invention> The embodiment of the present
invention is a technology that enables source localization of a plurality of sound sources by small
area and instantaneous observation, which has the advantages of the above-described
conventional method.
[0015]
In the embodiment of the present invention, the probability distribution of an acoustic signal
based on the time-frequency domain representation of a source-constrained partial differential
equation and the assumption of sparsity of all sound sources including noise (time of an acoustic
03-05-2019
6
signal in which a plurality of sound sources are mixed) In the frequency representation, based on
the assumption that only one sound source is dominant at each time frequency point, the
variational inference algorithm estimates which sound source seems to be dominant at each time
frequency point, and the source of each sound source Perform localization.
[0016]
Furthermore, in the embodiment of the present invention, the idea based on the Dirichlet process
mixture model can perform sound source localization according to the actual number of sound
sources while adapting the complexity of the model without assuming the number of sound
sources. .
[0017]
<Principle of Embodiment of the Present Invention> Next, the principle of estimating the position
of a sound source will be described.
[0018]
<Source-Restricted Partial Differential Equation> As shown in FIG. 1, let the position vector
serving as the reference of the observation point be and the position vector of the single wave
source be.
Assuming that the signal of the wave source is g (t), the sound speed is c, and the spherical wave
propagation from the single point wave source is assumed, the observed value at the observation
point is
[0019]
[0020]
It is expressed as
ここで、
03-05-2019
7
[0021]
[0022]
である。
Taking a unit vector from the observation point to the source direction,
[0023]
[0024]
Because of, the spatial derivative of is
[0025]
[0026]
となる。
Also, the time derivative of is
[0027]
[0028]
Therefore, substituting Eq. (1) and Eq. (8) into Eq. (7) eliminates g,
[0029]
03-05-2019
8
[0030]
, An equation including only the observed signal and its time-space derivative can be established.
Where is the distance from the observation point to the wave source.
This equation is called a sound source constraint equation.
As described above, the sound source constraint equation is a partial differential equation which
is established by an arbitrary sound source signal waveform and describes a unique relationship
between the position of the sound source and the spatial field.
[0031]
<Probabilistic Modeling of Sound Signal Based on Source Constraint Equation> In order to
approximate the space derivative of the observation signal by the space difference, a microphone
array shown in FIG. 2 is assumed below.
However, the arrangement of the microphones is sufficient as long as the spatial differentiation
of the observation signal can be approximated by the spatial difference, and the following theory
is not limited to the arrangement of FIG.
In the case of the microphone array of FIG. 2, it is possible to obtain the signals f0, j at the
reference point and the spatial difference in each direction at each time tj using seven
microphones.
Where j represents the index of sample time.
[0032]
03-05-2019
9
At this time, equation (9) is
[0033]
[0034]
It can be expressed.
Here, i = x, y, z, and nx, ny, nz are components of n in the x, y, z directions, respectively.
When transposing the left side of equation (10) to the right side
[0035]
[0036]
Is obtained.
Here, it is assumed that f0, j, fx, j, fy, j, fz, j are signals obtained by windowing with a window
function.
Assuming that the influence of both end points of the clipping interval can be ignored, equation
(11) is
[0037]
[0038]
03-05-2019
10
It is expressed as
However, F0, m, Fx, m, Fy, m, Fz, m are discrete Fourier transforms of f0, j, fx, j, fy, j, fz, j, and m is
a discrete frequency index.
[0039]
The right side of equation (12) is not necessarily exactly zero in practice due to the error
involved in the difference approximation.
Therefore, the right side of equation (11)
[0040]
[0041]
Replace with error variables as in, normal random variables with zero mean and independent
from each other (random variables with complex normal distribution)
[0042]
[0043]
Suppose.
Also, let each frequency component of the observation signal at the observation point be a
normal random variable with an average of 0 and a variance of σ <2> 0, m.
これは、
03-05-2019
11
[0044]
[0045]
It corresponds to assuming.
[0046]
Here, a vector of arranged and a vector of arranged
[0047]
[0048]
とする。
Equation (13) is
[0049]
[0050]
Can be written in the form of
ただし であり、 は
[0051]
[0052]
03-05-2019
12
Given by
From Eqs. (14) and (16), the mean is the variance covariance matrix
[0053]
[0054]
Complex normal distribution of
[0055]
[0056]
Obey.
であるので、 は
[0057]
[0058]
It is expressed as
Therefore, from equation (22), the probability density function of
[0059]
[0060]
03-05-2019
13
Get
From the above, the problem of localizing a single sound source given the observation spectrum
and its spatial difference is
[0061]
[0062]
It reduces to the maximum likelihood estimation problem which solves.
[0063]
<Allocation Algorithm of Plural Sound Sources Utilizing Sparseness of Sound Source> In many
real-world acoustic signals such as audio signals and musical tones, time frequency components
are sparse.
Therefore, even when a plurality of sound sources are simultaneously mixed, it can often be
assumed that only one sound source is dominant at each time frequency point.
Based on the assumption of the sparsity of the time-frequency component of the sound source
and probabilistic modeling of the above, it is possible to derive the probability distribution of the
observed signal in the presence of a plurality of sound sources and in the presence of noise.
[0064]
Let l be the index of the time of the extracted frame of the signal, k = 0,..., K, where k = 0
corresponds to noise, and k ≠ 0 corresponds to a point sound source.
Also, let the position of point sound source k be.
03-05-2019
14
Here, assuming the sparsity of time frequency components of all sound sources including noise,
only the zm th source at frequency m and time l has non-zero power, and the power of the other
sources is 0. .
At this time, the conditional probability density function of the time-frequency component of the
observed signal under the given zm, l and its spatial difference (hereinafter referred to as the
observed signal) is
[0065]
[0066]
Given by
ここで、
[0067]
[0068]
である。
Is the variance-covariance matrix of time-frequency components of noise, and is the product of
the normalized variance-covariance matrix model that depends only on frequency and the power
of noise that also depends on time.
[0069]
03-05-2019
15
[0070]
Shall be represented by
The setting method of when the diffusive noise is assumed will be described later.
Represents all unknown parameters.
The probability density function (likelihood function of) of the observation signal is
[0071]
[0072]
I can write.
From the above, the problem of estimating the position of each sound source in the presence of
multiple sound sources and noise is that the observation signal is given.
[0073]
[0074]
It reduces to the maximum likelihood estimation problem which solves.
The global solution of this optimization problem can not be solved analytically, but stationary
points can be searched by the Expectation-Maximization (EM) algorithm using.
03-05-2019
16
[0075]
<Non-Parametric Bayesian Modeling> In real environments, the number of sound sources is often
unknown.
In the above formulation, it is assumed that the number of sound sources K is known, but it is
preferable to be able to perform sound source localization according to the actual number of
sound sources while adapting the complexity of the model without assuming the number of
sound sources. Therefore, in the present embodiment, the above generation model is extended to
a Dirichlet process mixture model. Let k = 0 be the noise, let be the index of the point source, and
be a random variable generated according to the additive infinite dimensional discrete
distribution.
[0076]
[0077]
Here, a variable π is generated according to the beta distribution of the hyperparameter 確 率 0,
where the probability is (zm, l = 0) in which noise is dominant at point (m, l).
[0078]
[0079]
And the probability πk that the point source k ≠ 0 is dominant (zm, l = k) is determined
according to the bar-folding process
[0080]
[0081]
とする。
03-05-2019
17
Since the expectation value of generated by the above process tends to decrease exponentially as
k increases, the probability that the sound source corresponding to large k becomes active is
lower.
Therefore, when deducing parameters from the observation signal, an effect of trying to explain
the observation signal with a mixed model of the minimum number of required sound source
indices is brought about.
In the above generation model, all unknown variables Θ are as follows.
[0082]
[0083]
<Variational inference algorithm> By the above generational modeling,
[0084]
[0085]
I can write.
Also, the other prior distributions p (V), p (N), p (ρ), p (λ), p (ζ) and p (γ)
[0086]
[0087]
Assuming that the simultaneous distribution of the observation signal and the unknown variable
Θ
[0088]
03-05-2019
18
[0089]
Can be described specifically.
However, represents the von Mise-Fisher distribution, represents the real normal distribution,
and represents the gamma distribution.
Although it is difficult to analytically obtain the posterior distribution of Θ, an approximate
distribution can be obtained by iterative calculation based on the variational inference method.
Variational inference is the Kullback-Leibler (KL) divergence between the nonnegative inverse
function that satisfies
[0090]
[0091]
The equation (55) is an objective function.
を
[0092]
[0093]
Assuming that it can be approximated as, it is possible to obtain an approximate distribution of
by iteratively minimizing the objective function of equation (55) for.
03-05-2019
19
Also, with respect to, the following truncated approximation
[0094]
[0095]
I do.
This approximation does not mean that the complexity (number of sound sources) of the model is
fixed, but means that the function space of q is limited to a certain range.
Therefore, if you want to approximate as well as possible, you can enlarge and increase the range
that can be taken.
[0096]
By setting the variation for each q in equation (55) to 0, we obtain
These are called variational posterior distribution update equations.
[0097]
[0098]
However, denotes the expectation value of for, and is given by if X is a continuous variable and if
it is a discrete variable.
Also, represents a set of all elements except X in.
03-05-2019
20
The derivation will be described in the next section, but each variational posterior distribution
update equation has the form
[0099]
[0100]
<Derivation of Variational Update Formula> <Coupling Distribution> Log p (Θ, Y) can be
specifically written as follows according to the generation model established above.
[0101]
[0102]
The term related to N in <variational posterior distribution update expression of sound source
direction N> is
[0103]
[0104]
であり、
[0105]
[0106]
Therefore, the variational posterior distribution update equation of N is
[0107]
[0108]
03-05-2019
21
となる。
Therefore, the expected value is as follows.
[0109]
[0110]
The term related to ρ in <variational posterior distribution update expression of (the reciprocal
of 音源 source distance ρ) is
[0111]
[0112]
であり、
[0113]
[0114]
Therefore, the variational posterior distribution update equation of ρ is
[0115]
[0116]
となる。
Therefore, the expected value is as follows.
03-05-2019
22
[0117]
[0118]
The term related to Z in <variational a posteriori distribution update formula of active sound
source index Z> is
[0119]
[0120]
であり、
[0121]
[0122]
となる。
ただし、 は、k=0のとき、
[0123]
[0124]
k≠0のとき、
[0125]
[0126]
である。
03-05-2019
23
Therefore, the variational posterior distribution update equation of Z is
[0127]
[0128]
となる。
Therefore, the expected value is as follows.
[0129]
[0130]
The term related to V in <variational posterior distribution updating formula of rod folding ratio
V>
[0131]
[0132]
であり、
[0133]
[0134]
Therefore, the variational posterior distribution update equation of V is
[0135]
03-05-2019
24
[0136]
となる。
Therefore, the expected value is as follows.
[0137]
[0138]
Where is the digamma function.
[0139]
The term related to λ in <variational posterior distribution updating formula of accuracy
(inverse of variance) λ of error variable
[0140]
[0141]
であり、
[0142]
[0143]
Therefore, the variational posterior distribution update equation of λ is
[0144]
[0145]
03-05-2019
25
となる。
Therefore, the expected value is
[0146]
[0147]
The term related to ζ in <variational posterior distribution update expression of (source of
reciprocal power) 音源> is
[0148]
[0149]
であり、
[0150]
[0151]
Therefore, the variational posterior distribution update equation of ζ
[0152]
[0153]
となる。
Therefore, expected value
03-05-2019
26
[0154]
[0155]
The term related to γ in <variational a posteriori distribution update expression of (noise ratio of
inverse of γ)> is
[0156]
[0157]
であり、
[0158]
[0159]
Therefore, the variational posterior distribution update formula of γ is
[0160]
[0161]
となる。
Therefore, the expected value is as follows.
[0162]
[0163]
<Noise Dispersion Covariance Matrix W> Here, a setting example of the noise covariance matrix
03-05-2019
27
Wm when assuming the arrangement of seven microphones as shown in FIG. 2 will be described.
Here is the Fourier transform of.
[0164]
The relationship between and
[0165]
[0166]
In order to write, the variance-covariance matrix of is given by
Therefore, for example, when assuming spatially uncorrelated and equal power noise, since is an
identity matrix, the variance covariance matrix of
[0167]
[0168]
Just put it.
[0169]
A sound field having a distribution in which the energy density is uniform and the energy flow in
all directions can be regarded as equal probability in a certain area is called a diffuse sound field,
and the sound field of the reverberant environment is well approximated. It is known to
represent.
In the diffuse sound field, the spatial correlation coefficient between two points depends only on
the distance d,
03-05-2019
28
[0170]
[0171]
Given by
Thus, assuming diffusive noise, in the example of array geometry as in FIG.
[0172]
[0173]
となる。
Using this, the variance-covariance matrix of can be substituted by.
このとき、 は、
[0174]
[0175]
It becomes a diagonal matrix like.
[0176]
<Summary of variational inference algorithm> The variational posterior distribution update
equation for each variable is
03-05-2019
29
[0177]
[0178]
The update equation of the parameters of each distribution is given by
[0179]
[0180]
Here, when k = 0,
[0181]
[0182]
k≠0のとき、
[0183]
[0184]
である。
Also, the expected value appearing in the update equation is given as follows.
[0185]
[0186]
<System Configuration> Next, an embodiment of the present invention will be described by
taking as an example a case where the present invention is applied to a sound source localization
apparatus that estimates the positions of a plurality of sound sources from acoustic signals input
by a microphone array. .
03-05-2019
30
[0187]
As shown in FIG. 3, the sound source localization apparatus 100 according to the embodiment of
the present invention is configured by a computer including a CPU, a RAM, and a ROM storing a
program for executing a sound source localization processing routine. It is functionally
configured as follows.
[0188]
As shown in FIG. 3, the sound source localization apparatus 100 includes an input unit 10, an
arithmetic unit 20, and an output unit 90.
[0189]
The input unit 10 receives time-series data of an acoustic signal (hereinafter referred to as an
observation signal) in which sound source signals from a plurality of sound sources are mixed,
which are output from the microphones of the microphone array as shown in FIG.
[0190]
The calculation unit 20 includes a spatial difference calculation unit 22, a time frequency
expansion unit 24, and a sound source position estimation unit 25.
[0191]
The spatial difference calculation unit 22 acquires observation signals f0, j at the microphones of
the reference point at each time tj from observation signals output from the microphones of the
microphone array, and also according to the following equations: , Z spatial differences fx, j, fy, j,
fz, j are calculated.
[0192]
[0193]
The time frequency expansion unit 24 calculates observation time frequency components F0, m
at each frequency m from the observation signals f0, j at each time tj at the microphone of the
reference point obtained by the space difference calculation unit 22.
03-05-2019
31
In addition, the time-frequency expansion unit 24 calculates spatial differences of each frequency
m from the spatial differences fx, j, fy, j, fz, j in each direction x, y, z at each time tj obtained by
the spatial difference calculation unit 22. The observation time frequency components Fx, m, Fy,
m, Fz, m are calculated.
In the present embodiment, time-frequency expansion such as short-time Fourier transform or
wavelet transform is performed.
[0194]
The sound source position estimation unit 25 determines each frequency based on the
observation frequency component y including the observation time frequency components Fx, m,
Fy, m, Fz, m, F 0, m of each frequency m acquired in the time frequency expansion unit 24. The
variable N and variable ρ representing the positions of a plurality of sound sources k when
observation data Y consisting of observation time frequency components Fx, m, Fy, m, Fz, m, F0,
m of m are given, for each frequency Variable Z that represents an indicator that indicates the
index of the sound source that becomes dominant at each time, variable V for determining the
probability π k that each sound source k becomes dominant, the variance of the observation
time frequency component of each frequency for each of multiple directions Of a variable λ, and
the power of time-frequency components of each sound source k and noise ζ and an unknown
variable 含 む including variable γ, with a posterior distribution p (Θ ¦ Y) and a variation
function q (Θ) A function that represents the divergence between As the objective function, each
parameter representing the distribution of each of variable N, variable ρ, variable Z, variable V,
variable λ, variable ζ and variable γ is estimated so as to minimize the objective function
based on the variational inference method. , Estimate the position of each sound source k.
[0195]
Specifically, the sound source position estimation unit 25 includes a variable update unit 28 and
a convergence determination unit 30.
[0196]
The variable updating unit 28 first sets initial values of parameters (hereinafter referred to as
variational parameters) representing distributions of variable N, variable ρ, variable Z, variable
V, variable λ, variable ζ, and variable γ. Do.
03-05-2019
32
[0197]
In addition, the variable updating unit 28 updates the variational parameter in accordance with
the above equations (132) to (146) based on the observation data Y and the variation parameter.
[0198]
The convergence determination unit 30 determines whether or not a predetermined convergence
determination condition is satisfied, and when the convergence determination condition is not
satisfied, the processing of the variable updating unit 28 is repeated.
The convergence determination unit 30 determines the direction vector n <(k)> and the sound
source distance R <(k)> of each sound source k based on The output unit 90 outputs the result of
estimation of the position of k.
[0199]
As the convergence determination condition, it may be used that the number of iterations has
reached a predetermined number.
Note that it may be used as the convergence determination condition that it is considered that
the change rate of the parameter by one parameter update has become almost 1.
[0200]
<Operation of Sound Source Localization Device> Next, the operation of the sound source
localization device 100 according to the present embodiment will be described.
[0201]
When the input unit 10 receives time-series data of observation signals output from the
microphones of the microphone array, the sound source localization apparatus 100 executes a
sound source localization processing routine shown in FIG. 4.
[0202]
03-05-2019
33
First, in step S120, observation signals f0, j at the microphone of the reference point are obtained
at each time tj from time-series data of observation signals input from the microphones of the
microphone array, and x, y, z directions are obtained. The spatial differences fx, j, fy, j, fz, j are
calculated.
[0203]
In step S121, observation time frequency components F0, m at each frequency m are calculated
from the observation signals f0, j at each time tj in the microphone at the reference point
obtained in step S120.
Also, from the spatial differences fx, j, fy, j, fz, j in each direction x, y, z at each time tj, the
observation time frequency components Fx, m, Fy, m, Fz, m of each frequency m are calculated
Do.
[0204]
In step S122, an initial value of the variation parameter is set.
[0205]
In step S124, the variational parameter is updated according to the equation (132) to the
equation (146) based on the observation data Y calculated in the step S121 and the initial value
or the value previously updated.
[0206]
In step S125, it is determined whether or not a predetermined convergence determination
condition is satisfied. If the convergence determination condition is not satisfied, the process
returns to step S124.
On the other hand, if the convergence determination condition is satisfied, the process proceeds
to step S126.
03-05-2019
34
[0207]
In step S126, the direction vector n <(k)> of each sound source k and the sound source distance
R <(k)> are obtained as estimation results of the position of each sound source k based on the
final obtained in step S124. The sound source localization processing routine is ended by the
output unit 90.
[0208]
<Experiment> In order to verify the performance of the proposed method, numerical simulation
of multiple sound source localization under reverberant environment was performed.
This time, we used a two-dimensional model in the x and y directions.
The room size was 6 m × 10 m × 4 m, and seven microphones were arranged at intervals of
0.03 m as shown in Fig. 2 above.
The reflection coefficients of the wall surface were 0: 7308 and 0: 4566 (the reverberation time
according to the reverberation formula of Sabine corresponded to 0.5 s and 0.2 s, respectively).
The sampling frequency of the microphone is 32 kHz, the frame length of the short-time Fourier
transform is 64 ms (overlap is 32 ms), and the total length of the observation signal is 2779 ms
and 1665 ms.
The number of sound sources was three, and the positions of the sound sources were (1, 0, 0) m,
(-0.5, 0.87, 0) m, and (-0.5, -0.87, 0) m from the center of the room (see FIG. 5) ).
In FIG. 5, the crosses indicate the sound source position, and the circles indicate the center
position of the microphone array.
The sound source signal used speech speed variation type speech database (SRV-DB).
03-05-2019
35
As the proposed method, the method based on variational inference (VBEM) which is the method
of this embodiment, the method by EM algorithm (EM1) assuming the correct number of sound
sources (3), and the number of incorrect sound sources (6) are assumed. We evaluated the
method based on the EM algorithm (EM2) and compared it with the conventional method, MUSIC
method.
In the EM algorithm, stationary noise was assumed.
In any method, the presence and direction of the sound source are estimated based on the
detection threshold, and if there is an error in the angle with the true sound source direction
within ± τ, then it is correct. If not, the dropout error, the detected sound source The F scale
was calculated as misinsertion among the directions that do not belong to any true sound source
direction.
For each τ, plots of the highest F 2 scale as the detection threshold is varied are shown in FIGS.
FIG. 6 shows the localization accuracy of each method under the condition of reverberation time
0.5 s and observation length 2779 ms.
FIG. 7 shows the localization accuracy of each method under the condition of reverberation time
0.5 s and observation length 1665 ms.
FIG. 8 shows the localization accuracy of each method under the condition of reverberation time
0.2 s and observation length 2779 ms.
FIG. 9 shows the localization accuracy of each method under the condition of reverberation time
0.2 s and observation length 1665 ms.
In most cases, it has been confirmed that the method based on the variational inference method,
which is the proposed method, achieves higher accuracy localization than the other methods.
[0209]
03-05-2019
36
As described above, according to the sound source localization apparatus according to the
present embodiment, observation data Y consisting of observation time frequency components of
each frequency of the reference microphone and observation time frequency components of each
frequency for each of a plurality of directions. A variable N representing the position of a
plurality of sound sources k and a variable 、 given a 表 す, a variable Z representing an indicator
indicating the index of the sound source which becomes dominant at each time for each
frequency, each sound source k becomes dominant Variable V for determining probability π k,
variable λ representing the dispersion of observed time-frequency components of each
frequency in each of a plurality of directions, and variable 並 び に and variables representing the
power of time-frequency components of each sound source k and each frequency of noise Based
on the variational inference method with the function representing the divergence representing
the difference between the posterior distribution p (Θ ¦ Y) of the unknown variable 含 む
including γ and the invariant function q (Θ) as the objective function Estimate each parameter
representing the distribution of each of variable N, variable ρ, variable Z, variable V, variable λ,
variable ζ and variable γ so as to minimize the motion objective function, and estimate the
position of each sound source k Thus, even when the number of sound sources is unknown, it is
possible to simultaneously localize a plurality of sound sources.
[0210]
The present invention is not limited to the above-described embodiment, and various
modifications and applications can be made without departing from the scope of the present
invention.
[0211]
For example, although the above-described sound source localization apparatus has a computer
system inside, the "computer system" also includes a homepage providing environment (or
display environment) if the WWW system is used.
[0212]
Furthermore, although the present invention has been described as an embodiment in which the
program is installed in advance, it is also possible to provide the program by storing the program
in a computer readable recording medium.
[0213]
DESCRIPTION OF SYMBOLS 10 input part 20 calculation part 22 space difference calculation
part 24 time frequency expansion part 25 sound source position estimation part 28 variable
update part 30 convergence determination part 90 output part 100 sound source localization
03-05-2019
37
apparatus
03-05-2019
38