Patent Translate Powered by EPO and Google Notice This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate, complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or financial decisions, should not be based on machine-translation output. DESCRIPTION JP2018142917 Abstract: A plurality of sound sources can be localized simultaneously even when the number of sound sources is unknown. A sound source position estimation unit 25 indicates a variable N and a variable 表 す representing the positions of a plurality of sound sources k when observation data Y is given, and an index of a sound source which becomes dominant at each time with respect to each frequency. A variable Z representing an indicator, a variable V for determining the probability πk that each sound source k becomes dominant, a variable λ representing the variance of observed time-frequency components of each frequency for each of a plurality of directions, and each sound source k and noise Objective function representing a divergence representing the difference between the posterior distribution p (Θ ¦ Y) of the unknown variable 含 む including the variable ζ and the variable γ representing the power of the time frequency component of each frequency and the variable function q (Θ) As parameters of the variables N, ρ, Z, V, λ, ζ, and γ are estimated to minimize the objective function based on the variational inference method, Estimate the location of sound source k Set. [Selected figure] Figure 3 Sound source localization apparatus, method, and program [0001] The present invention relates to a sound source localization apparatus, method, and program, and more particularly to a sound source localization apparatus, method, and program for estimating the position of a sound source from an acoustic signal. [0002] 03-05-2019 1 Source localization has a wide range of applications, such as radar and sonar. In particular, it is important to be able to localize and track moving wave sources instantaneously in small arrays. Conventional methods for source localization problems include Multiple Signal Classication (MUSIC) method, Generalized Cross-Correlation methods with Phase Transform (GCC-PHAT) method, and methods based on source-constrained partial differential equations. [0003] Since the MUSIC method and the GCC-PHAT method assume a plane wave as a sound source and use the difference in arrival time between sensors of each sound source as a clue for localization, generally larger array sizes are advantageous. Also, since all of them are methods based on statistics such as autocorrelation function and cross correlation function between received signals of the sensor array, it is necessary to take a sufficiently long observation time width in order to localize the sound source with high accuracy. For this reason, these methods are not always suitable for source localization with only a small array size and instantaneous observation. On the other hand, the method based on the source constrained partial differential equation performs sound source localization based on the spatiotemporal partial differential equation of the acoustic signal established at each time, and theoretically, the source localization is performed only by instantaneous small area observation. It is possible to do. [0004] However, since this method is based on an equation that holds for a single wave source, it is said that it is vulnerable when the observation acoustic signal deviates from the partial differential equation as in the case of noise or multiple point sources. It has a drawback. In this framework, in order to enable expansion to the case where random noise and multiple point sound sources exist, a probability model of array observation signals based on a source constrained partial differential equation and a source localization algorithm based on that are disclosed in NonPatent Document 1. Proposed. [0005] Atsushi Suzuki, Hirokazu Kameoka, "Probability modeling and multiple source localization algorithm of acoustic signal based on source constrained difference equation," Proceedings of the Acoustical Society of Japan, pp.615-618, March 2016 03-05-2019 2 [0006] Although it is necessary to assume the number of sound sources in the method of Non-Patent Document 1, the number of sound sources is often unknown in a real environment. Therefore, it is desirable to be able to perform sound source localization according to the actual number of sound sources while adapting the complexity of the model without assuming the number of sound sources. [0007] The present invention has been made in view of the above circumstances, and it is an object of the present invention to provide a sound source localization apparatus, method and program capable of simultaneously localizing a plurality of sound sources even when the number of sound sources is unknown. I assume. [0008] In order to achieve the above object, a sound source localization apparatus according to the present invention estimates a position of each of a plurality of sound sources from an observation signal in which sound source signals from a plurality of sound sources input by a microphone array are mixed. A localization apparatus comprising: a spatial difference calculation unit that calculates a difference between the observation signals input by a pair of microphones arranged in the direction among the microphone arrays in each of a plurality of directions; Among them, the observation time frequency component of each frequency is output with the observation signal input by the reference microphone as an input, and the observation signal calculated for each of the plurality of directions by the spatial difference calculation unit A time frequency expansion unit which outputs an observation time frequency component of each frequency in each of the plurality of directions with a difference as an input; At each frequency of the reference microphone based on an observation time frequency component of each frequency of the reference microphone and an observation time frequency component of each frequency for each of the plurality of directions output by the time frequency expansion unit; A variable N and a variable 表 す representing the positions of a plurality of sound sources k when observation data Y consisting of observation time frequency components for each of the plurality of directions and observation time frequency components are given, each for each frequency A variable Z representing an indicator indicating an index of a sound source which becomes dominant at time, a variable V for determining a probability π k which each sound source k 03-05-2019 3 becomes dominant, a dispersion of observation time frequency components of each frequency with respect to each of the plurality of directions. Unknown variable including a variable λ representing and a variable 表 す representing a power of time frequency components of each frequency of each sound source k and noise With the objective function representing the divergence representing the difference between the posterior distribution p (Θ ¦ Y) of Θ and the invariant function q (Θ) as the objective function, the objective function is minimized based on the variational inference method. A parameter estimation unit that estimates each parameter that represents the distribution of each of the variable N, the variable 、, the variable Z, the variable V, the variable λ, the variable ζ, and the variable γ, and the position of each estimated sound source k And a sound source position estimation unit configured to estimate the position of each sound source k based on a parameter representing the distribution of the variable N and a parameter representing the distribution of the variable ρ. [0009] A sound source localization method according to the present invention is a sound source localization method in a sound source localization apparatus for estimating the position of each of the plurality of sound sources from observation signals in which sound source signals from a plurality of sound sources input by a microphone array are mixed. The spatial difference calculation unit calculates, for each of a plurality of directions, a difference between the observation signals input by a pair of microphones aligned in the direction in the microphone array, and the time-frequency expansion unit The observation time frequency component of each frequency is output with the observation signal input by the reference microphone in the microphone array as an input, and the space difference calculation unit calculates the direction calculated for each of the plurality of directions. An observation time frequency component of each frequency is output for each of the plurality of directions with the difference of the observation signal as an input. The reference estimation unit is based on the observation time frequency component of each frequency of the reference microphone and the observation time frequency component of each frequency for each of the plurality of directions output by the time frequency expansion unit. A variable N and a variable 表 す representing the positions of a plurality of sound sources k when observation data Y consisting of an observation time frequency component of each frequency of the microphone and an observation time frequency component of each frequency in each of the plurality of directions are given. A variable Z representing an indicator indicating an index of a sound source which becomes dominant at each time for each frequency, a variable V for determining a probability πk where each sound source k becomes dominant, observation of each frequency in each of the plurality of directions A variable λ representing the variance of the time frequency component, and a variable 変 数 and a variable γ representing the power of the time frequency component of each frequency of each sound source k and noise Minimize the objective function based on the variational inference method with the function representing the divergence representing the difference between the posterior distribution p (Θ ¦ Y) of the unknown variable 含 む including and the invariant 03-05-2019 4 function q (Θ) as the objective function As described above, each parameter representing the distribution of each of the variable N, the variable 、, the variable Z, the variable V, the variable λ, the variable ζ, and the variable γ is estimated, and the sound source position estimating unit estimates each parameter The position of each sound source k is estimated based on the parameter representing the distribution of the variable N representing the position of the sound source k and the parameter representing the distribution of the variable ρ. [0010] A program according to the present invention is a program for causing a computer to function as each part of the sound source localization apparatus described above. [0011] As described above, according to the sound source localization apparatus, method, and program of the present invention, the observation time frequency component of each frequency of the reference microphone and the observation time frequency component of each frequency in each of the plurality of directions Variables N and 表 す representing positions of a plurality of sound sources k when given observation data Y, a variable Z representing an indicator representing an index of a sound source which becomes dominant at each time with respect to each frequency, and each sound source k A variable V for determining a dominant probability π k, a variable λ representing the dispersion of observed time-frequency components of each frequency in each of the plurality of directions, and powers of time-frequency components of each frequency of each sound source k and noise A function representing the divergence representing the difference between the posterior distribution p (Θ ¦ Y) of the unknown variable 含 む including the variable 表 す and the variable γ and the invariant function q (Θ) As each variable representing the distribution of each of the variable N, the variable 、, the variable Z, the variable V, the variable λ, the variable 前 記, and the variable γ so as to minimize the objective function based on the variational inference method. By estimating the parameters and estimating the position of each sound source k, it is possible to simultaneously localize a plurality of sound sources even when the number of sound sources is unknown. [0012] It is a figure which shows the spherical wave which arrives at the observation point r from a point sound source. It is a figure showing an example of arrangement of a microphone array. 03-05-2019 5 It is the schematic which shows the structure of the sound source localization apparatus which concerns on embodiment of this invention. It is a flowchart which shows the content of the sound source localization process routine in the sound source localization apparatus based on embodiment of this invention. It is a figure which shows the sound source position in an experiment, and a microphone position. It is a figure which shows the experimental result in the conditions of reverberation time 0.5 s, and observation length 2779 ms. It is a figure which shows the experimental result in the conditions of reverberation time 0.5 s, and observation length 1665 ms. It is a figure which shows the experimental result in the conditions of reverberation time 0.2 s, and observation length 2779 ms. It is a figure which shows the experimental result on the conditions of reverberation time 0.2 s, and observation length 1665 ms. [0013] Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. The technique proposed in the present invention is a signal processing technique aiming to estimate a wave source position from an acoustic signal. [0014] <Summary of the Embodiment of the Present Invention> The embodiment of the present invention is a technology that enables source localization of a plurality of sound sources by small area and instantaneous observation, which has the advantages of the above-described conventional method. [0015] In the embodiment of the present invention, the probability distribution of an acoustic signal based on the time-frequency domain representation of a source-constrained partial differential equation and the assumption of sparsity of all sound sources including noise (time of an acoustic 03-05-2019 6 signal in which a plurality of sound sources are mixed) In the frequency representation, based on the assumption that only one sound source is dominant at each time frequency point, the variational inference algorithm estimates which sound source seems to be dominant at each time frequency point, and the source of each sound source Perform localization. [0016] Furthermore, in the embodiment of the present invention, the idea based on the Dirichlet process mixture model can perform sound source localization according to the actual number of sound sources while adapting the complexity of the model without assuming the number of sound sources. . [0017] <Principle of Embodiment of the Present Invention> Next, the principle of estimating the position of a sound source will be described. [0018] <Source-Restricted Partial Differential Equation> As shown in FIG. 1, let the position vector serving as the reference of the observation point be and the position vector of the single wave source be. Assuming that the signal of the wave source is g (t), the sound speed is c, and the spherical wave propagation from the single point wave source is assumed, the observed value at the observation point is [0019] [0020] It is expressed as ここで、 03-05-2019 7 [0021] [0022] である。 Taking a unit vector from the observation point to the source direction, [0023] [0024] Because of, the spatial derivative of is [0025] [0026] となる。 Also, the time derivative of is [0027] [0028] Therefore, substituting Eq. (1) and Eq. (8) into Eq. (7) eliminates g, [0029] 03-05-2019 8 [0030] , An equation including only the observed signal and its time-space derivative can be established. Where is the distance from the observation point to the wave source. This equation is called a sound source constraint equation. As described above, the sound source constraint equation is a partial differential equation which is established by an arbitrary sound source signal waveform and describes a unique relationship between the position of the sound source and the spatial field. [0031] <Probabilistic Modeling of Sound Signal Based on Source Constraint Equation> In order to approximate the space derivative of the observation signal by the space difference, a microphone array shown in FIG. 2 is assumed below. However, the arrangement of the microphones is sufficient as long as the spatial differentiation of the observation signal can be approximated by the spatial difference, and the following theory is not limited to the arrangement of FIG. In the case of the microphone array of FIG. 2, it is possible to obtain the signals f0, j at the reference point and the spatial difference in each direction at each time tj using seven microphones. Where j represents the index of sample time. [0032] 03-05-2019 9 At this time, equation (9) is [0033] [0034] It can be expressed. Here, i = x, y, z, and nx, ny, nz are components of n in the x, y, z directions, respectively. When transposing the left side of equation (10) to the right side [0035] [0036] Is obtained. Here, it is assumed that f0, j, fx, j, fy, j, fz, j are signals obtained by windowing with a window function. Assuming that the influence of both end points of the clipping interval can be ignored, equation (11) is [0037] [0038] 03-05-2019 10 It is expressed as However, F0, m, Fx, m, Fy, m, Fz, m are discrete Fourier transforms of f0, j, fx, j, fy, j, fz, j, and m is a discrete frequency index. [0039] The right side of equation (12) is not necessarily exactly zero in practice due to the error involved in the difference approximation. Therefore, the right side of equation (11) [0040] [0041] Replace with error variables as in, normal random variables with zero mean and independent from each other (random variables with complex normal distribution) [0042] [0043] Suppose. Also, let each frequency component of the observation signal at the observation point be a normal random variable with an average of 0 and a variance of σ <2> 0, m. これは、 03-05-2019 11 [0044] [0045] It corresponds to assuming. [0046] Here, a vector of arranged and a vector of arranged [0047] [0048] とする。 Equation (13) is [0049] [0050] Can be written in the form of ただし であり、 は [0051] [0052] 03-05-2019 12 Given by From Eqs. (14) and (16), the mean is the variance covariance matrix [0053] [0054] Complex normal distribution of [0055] [0056] Obey. であるので、 は [0057] [0058] It is expressed as Therefore, from equation (22), the probability density function of [0059] [0060] 03-05-2019 13 Get From the above, the problem of localizing a single sound source given the observation spectrum and its spatial difference is [0061] [0062] It reduces to the maximum likelihood estimation problem which solves. [0063] <Allocation Algorithm of Plural Sound Sources Utilizing Sparseness of Sound Source> In many real-world acoustic signals such as audio signals and musical tones, time frequency components are sparse. Therefore, even when a plurality of sound sources are simultaneously mixed, it can often be assumed that only one sound source is dominant at each time frequency point. Based on the assumption of the sparsity of the time-frequency component of the sound source and probabilistic modeling of the above, it is possible to derive the probability distribution of the observed signal in the presence of a plurality of sound sources and in the presence of noise. [0064] Let l be the index of the time of the extracted frame of the signal, k = 0,..., K, where k = 0 corresponds to noise, and k ≠ 0 corresponds to a point sound source. Also, let the position of point sound source k be. 03-05-2019 14 Here, assuming the sparsity of time frequency components of all sound sources including noise, only the zm th source at frequency m and time l has non-zero power, and the power of the other sources is 0. . At this time, the conditional probability density function of the time-frequency component of the observed signal under the given zm, l and its spatial difference (hereinafter referred to as the observed signal) is [0065] [0066] Given by ここで、 [0067] [0068] である。 Is the variance-covariance matrix of time-frequency components of noise, and is the product of the normalized variance-covariance matrix model that depends only on frequency and the power of noise that also depends on time. [0069] 03-05-2019 15 [0070] Shall be represented by The setting method of when the diffusive noise is assumed will be described later. Represents all unknown parameters. The probability density function (likelihood function of) of the observation signal is [0071] [0072] I can write. From the above, the problem of estimating the position of each sound source in the presence of multiple sound sources and noise is that the observation signal is given. [0073] [0074] It reduces to the maximum likelihood estimation problem which solves. The global solution of this optimization problem can not be solved analytically, but stationary points can be searched by the Expectation-Maximization (EM) algorithm using. 03-05-2019 16 [0075] <Non-Parametric Bayesian Modeling> In real environments, the number of sound sources is often unknown. In the above formulation, it is assumed that the number of sound sources K is known, but it is preferable to be able to perform sound source localization according to the actual number of sound sources while adapting the complexity of the model without assuming the number of sound sources. Therefore, in the present embodiment, the above generation model is extended to a Dirichlet process mixture model. Let k = 0 be the noise, let be the index of the point source, and be a random variable generated according to the additive infinite dimensional discrete distribution. [0076] [0077] Here, a variable π is generated according to the beta distribution of the hyperparameter 確 率 0, where the probability is (zm, l = 0) in which noise is dominant at point (m, l). [0078] [0079] And the probability πk that the point source k ≠ 0 is dominant (zm, l = k) is determined according to the bar-folding process [0080] [0081] とする。 03-05-2019 17 Since the expectation value of generated by the above process tends to decrease exponentially as k increases, the probability that the sound source corresponding to large k becomes active is lower. Therefore, when deducing parameters from the observation signal, an effect of trying to explain the observation signal with a mixed model of the minimum number of required sound source indices is brought about. In the above generation model, all unknown variables Θ are as follows. [0082] [0083] <Variational inference algorithm> By the above generational modeling, [0084] [0085] I can write. Also, the other prior distributions p (V), p (N), p (ρ), p (λ), p (ζ) and p (γ) [0086] [0087] Assuming that the simultaneous distribution of the observation signal and the unknown variable Θ [0088] 03-05-2019 18 [0089] Can be described specifically. However, represents the von Mise-Fisher distribution, represents the real normal distribution, and represents the gamma distribution. Although it is difficult to analytically obtain the posterior distribution of Θ, an approximate distribution can be obtained by iterative calculation based on the variational inference method. Variational inference is the Kullback-Leibler (KL) divergence between the nonnegative inverse function that satisfies [0090] [0091] The equation (55) is an objective function. を [0092] [0093] Assuming that it can be approximated as, it is possible to obtain an approximate distribution of by iteratively minimizing the objective function of equation (55) for. 03-05-2019 19 Also, with respect to, the following truncated approximation [0094] [0095] I do. This approximation does not mean that the complexity (number of sound sources) of the model is fixed, but means that the function space of q is limited to a certain range. Therefore, if you want to approximate as well as possible, you can enlarge and increase the range that can be taken. [0096] By setting the variation for each q in equation (55) to 0, we obtain These are called variational posterior distribution update equations. [0097] [0098] However, denotes the expectation value of for, and is given by if X is a continuous variable and if it is a discrete variable. Also, represents a set of all elements except X in. 03-05-2019 20 The derivation will be described in the next section, but each variational posterior distribution update equation has the form [0099] [0100] <Derivation of Variational Update Formula> <Coupling Distribution> Log p (Θ, Y) can be specifically written as follows according to the generation model established above. [0101] [0102] The term related to N in <variational posterior distribution update expression of sound source direction N> is [0103] [0104] であり、 [0105] [0106] Therefore, the variational posterior distribution update equation of N is [0107] [0108] 03-05-2019 21 となる。 Therefore, the expected value is as follows. [0109] [0110] The term related to ρ in <variational posterior distribution update expression of (the reciprocal of 音源 source distance ρ) is [0111] [0112] であり、 [0113] [0114] Therefore, the variational posterior distribution update equation of ρ is [0115] [0116] となる。 Therefore, the expected value is as follows. 03-05-2019 22 [0117] [0118] The term related to Z in <variational a posteriori distribution update formula of active sound source index Z> is [0119] [0120] であり、 [0121] [0122] となる。 ただし、 は、ｋ＝０のとき、 [0123] [0124] ｋ≠０のとき、 [0125] [0126] である。 03-05-2019 23 Therefore, the variational posterior distribution update equation of Z is [0127] [0128] となる。 Therefore, the expected value is as follows. [0129] [0130] The term related to V in <variational posterior distribution updating formula of rod folding ratio V> [0131] [0132] であり、 [0133] [0134] Therefore, the variational posterior distribution update equation of V is [0135] 03-05-2019 24 [0136] となる。 Therefore, the expected value is as follows. [0137] [0138] Where is the digamma function. [0139] The term related to λ in <variational posterior distribution updating formula of accuracy (inverse of variance) λ of error variable [0140] [0141] であり、 [0142] [0143] Therefore, the variational posterior distribution update equation of λ is [0144] [0145] 03-05-2019 25 となる。 Therefore, the expected value is [0146] [0147] The term related to ζ in <variational posterior distribution update expression of (source of reciprocal power) 音源> is [0148] [0149] であり、 [0150] [0151] Therefore, the variational posterior distribution update equation of ζ [0152] [0153] となる。 Therefore, expected value 03-05-2019 26 [0154] [0155] The term related to γ in <variational a posteriori distribution update expression of (noise ratio of inverse of γ)> is [0156] [0157] であり、 [0158] [0159] Therefore, the variational posterior distribution update formula of γ is [0160] [0161] となる。 Therefore, the expected value is as follows. [0162] [0163] <Noise Dispersion Covariance Matrix W> Here, a setting example of the noise covariance matrix 03-05-2019 27 Wm when assuming the arrangement of seven microphones as shown in FIG. 2 will be described. Here is the Fourier transform of. [0164] The relationship between and [0165] [0166] In order to write, the variance-covariance matrix of is given by Therefore, for example, when assuming spatially uncorrelated and equal power noise, since is an identity matrix, the variance covariance matrix of [0167] [0168] Just put it. [0169] A sound field having a distribution in which the energy density is uniform and the energy flow in all directions can be regarded as equal probability in a certain area is called a diffuse sound field, and the sound field of the reverberant environment is well approximated. It is known to represent. In the diffuse sound field, the spatial correlation coefficient between two points depends only on the distance d, 03-05-2019 28 [0170] [0171] Given by Thus, assuming diffusive noise, in the example of array geometry as in FIG. [0172] [0173] となる。 Using this, the variance-covariance matrix of can be substituted by. このとき、 は、 [0174] [0175] It becomes a diagonal matrix like. [0176] <Summary of variational inference algorithm> The variational posterior distribution update equation for each variable is 03-05-2019 29 [0177] [0178] The update equation of the parameters of each distribution is given by [0179] [0180] Here, when k = 0, [0181] [0182] ｋ≠０のとき、 [0183] [0184] である。 Also, the expected value appearing in the update equation is given as follows. [0185] [0186] <System Configuration> Next, an embodiment of the present invention will be described by taking as an example a case where the present invention is applied to a sound source localization apparatus that estimates the positions of a plurality of sound sources from acoustic signals input by a microphone array. . 03-05-2019 30 [0187] As shown in FIG. 3, the sound source localization apparatus 100 according to the embodiment of the present invention is configured by a computer including a CPU, a RAM, and a ROM storing a program for executing a sound source localization processing routine. It is functionally configured as follows. [0188] As shown in FIG. 3, the sound source localization apparatus 100 includes an input unit 10, an arithmetic unit 20, and an output unit 90. [0189] The input unit 10 receives time-series data of an acoustic signal (hereinafter referred to as an observation signal) in which sound source signals from a plurality of sound sources are mixed, which are output from the microphones of the microphone array as shown in FIG. [0190] The calculation unit 20 includes a spatial difference calculation unit 22, a time frequency expansion unit 24, and a sound source position estimation unit 25. [0191] The spatial difference calculation unit 22 acquires observation signals f0, j at the microphones of the reference point at each time tj from observation signals output from the microphones of the microphone array, and also according to the following equations: , Z spatial differences fx, j, fy, j, fz, j are calculated. [0192] [0193] The time frequency expansion unit 24 calculates observation time frequency components F0, m at each frequency m from the observation signals f0, j at each time tj at the microphone of the reference point obtained by the space difference calculation unit 22. 03-05-2019 31 In addition, the time-frequency expansion unit 24 calculates spatial differences of each frequency m from the spatial differences fx, j, fy, j, fz, j in each direction x, y, z at each time tj obtained by the spatial difference calculation unit 22. The observation time frequency components Fx, m, Fy, m, Fz, m are calculated. In the present embodiment, time-frequency expansion such as short-time Fourier transform or wavelet transform is performed. [0194] The sound source position estimation unit 25 determines each frequency based on the observation frequency component y including the observation time frequency components Fx, m, Fy, m, Fz, m, F 0, m of each frequency m acquired in the time frequency expansion unit 24. The variable N and variable ρ representing the positions of a plurality of sound sources k when observation data Y consisting of observation time frequency components Fx, m, Fy, m, Fz, m, F0, m of m are given, for each frequency Variable Z that represents an indicator that indicates the index of the sound source that becomes dominant at each time, variable V for determining the probability π k that each sound source k becomes dominant, the variance of the observation time frequency component of each frequency for each of multiple directions Of a variable λ, and the power of time-frequency components of each sound source k and noise ζ and an unknown variable 含 む including variable γ, with a posterior distribution p (Θ ¦ Y) and a variation function q (Θ) A function that represents the divergence between As the objective function, each parameter representing the distribution of each of variable N, variable ρ, variable Z, variable V, variable λ, variable ζ and variable γ is estimated so as to minimize the objective function based on the variational inference method. , Estimate the position of each sound source k. [0195] Specifically, the sound source position estimation unit 25 includes a variable update unit 28 and a convergence determination unit 30. [0196] The variable updating unit 28 first sets initial values of parameters (hereinafter referred to as variational parameters) representing distributions of variable N, variable ρ, variable Z, variable V, variable λ, variable ζ, and variable γ. Do. 03-05-2019 32 [0197] In addition, the variable updating unit 28 updates the variational parameter in accordance with the above equations (132) to (146) based on the observation data Y and the variation parameter. [0198] The convergence determination unit 30 determines whether or not a predetermined convergence determination condition is satisfied, and when the convergence determination condition is not satisfied, the processing of the variable updating unit 28 is repeated. The convergence determination unit 30 determines the direction vector n <(k)> and the sound source distance R <(k)> of each sound source k based on The output unit 90 outputs the result of estimation of the position of k. [0199] As the convergence determination condition, it may be used that the number of iterations has reached a predetermined number. Note that it may be used as the convergence determination condition that it is considered that the change rate of the parameter by one parameter update has become almost 1. [0200] <Operation of Sound Source Localization Device> Next, the operation of the sound source localization device 100 according to the present embodiment will be described. [0201] When the input unit 10 receives time-series data of observation signals output from the microphones of the microphone array, the sound source localization apparatus 100 executes a sound source localization processing routine shown in FIG. 4. [0202] 03-05-2019 33 First, in step S120, observation signals f0, j at the microphone of the reference point are obtained at each time tj from time-series data of observation signals input from the microphones of the microphone array, and x, y, z directions are obtained. The spatial differences fx, j, fy, j, fz, j are calculated. [0203] In step S121, observation time frequency components F0, m at each frequency m are calculated from the observation signals f0, j at each time tj in the microphone at the reference point obtained in step S120. Also, from the spatial differences fx, j, fy, j, fz, j in each direction x, y, z at each time tj, the observation time frequency components Fx, m, Fy, m, Fz, m of each frequency m are calculated Do. [0204] In step S122, an initial value of the variation parameter is set. [0205] In step S124, the variational parameter is updated according to the equation (132) to the equation (146) based on the observation data Y calculated in the step S121 and the initial value or the value previously updated. [0206] In step S125, it is determined whether or not a predetermined convergence determination condition is satisfied. If the convergence determination condition is not satisfied, the process returns to step S124. On the other hand, if the convergence determination condition is satisfied, the process proceeds to step S126. 03-05-2019 34 [0207] In step S126, the direction vector n <(k)> of each sound source k and the sound source distance R <(k)> are obtained as estimation results of the position of each sound source k based on the final obtained in step S124. The sound source localization processing routine is ended by the output unit 90. [0208] <Experiment> In order to verify the performance of the proposed method, numerical simulation of multiple sound source localization under reverberant environment was performed. This time, we used a two-dimensional model in the x and y directions. The room size was 6 m × 10 m × 4 m, and seven microphones were arranged at intervals of 0.03 m as shown in Fig. 2 above. The reflection coefficients of the wall surface were 0: 7308 and 0: 4566 (the reverberation time according to the reverberation formula of Sabine corresponded to 0.5 s and 0.2 s, respectively). The sampling frequency of the microphone is 32 kHz, the frame length of the short-time Fourier transform is 64 ms (overlap is 32 ms), and the total length of the observation signal is 2779 ms and 1665 ms. The number of sound sources was three, and the positions of the sound sources were (1, 0, 0) m, (-0.5, 0.87, 0) m, and (-0.5, -0.87, 0) m from the center of the room (see FIG. 5) ). In FIG. 5, the crosses indicate the sound source position, and the circles indicate the center position of the microphone array. The sound source signal used speech speed variation type speech database (SRV-DB). 03-05-2019 35 As the proposed method, the method based on variational inference (VBEM) which is the method of this embodiment, the method by EM algorithm (EM1) assuming the correct number of sound sources (3), and the number of incorrect sound sources (6) are assumed. We evaluated the method based on the EM algorithm (EM2) and compared it with the conventional method, MUSIC method. In the EM algorithm, stationary noise was assumed. In any method, the presence and direction of the sound source are estimated based on the detection threshold, and if there is an error in the angle with the true sound source direction within ± τ, then it is correct. If not, the dropout error, the detected sound source The F scale was calculated as misinsertion among the directions that do not belong to any true sound source direction. For each τ, plots of the highest F 2 scale as the detection threshold is varied are shown in FIGS. FIG. 6 shows the localization accuracy of each method under the condition of reverberation time 0.5 s and observation length 2779 ms. FIG. 7 shows the localization accuracy of each method under the condition of reverberation time 0.5 s and observation length 1665 ms. FIG. 8 shows the localization accuracy of each method under the condition of reverberation time 0.2 s and observation length 2779 ms. FIG. 9 shows the localization accuracy of each method under the condition of reverberation time 0.2 s and observation length 1665 ms. In most cases, it has been confirmed that the method based on the variational inference method, which is the proposed method, achieves higher accuracy localization than the other methods. [0209] 03-05-2019 36 As described above, according to the sound source localization apparatus according to the present embodiment, observation data Y consisting of observation time frequency components of each frequency of the reference microphone and observation time frequency components of each frequency for each of a plurality of directions. A variable N representing the position of a plurality of sound sources k and a variable 、 given a 表 す, a variable Z representing an indicator indicating the index of the sound source which becomes dominant at each time for each frequency, each sound source k becomes dominant Variable V for determining probability π k, variable λ representing the dispersion of observed time-frequency components of each frequency in each of a plurality of directions, and variable 並 び に and variables representing the power of time-frequency components of each sound source k and each frequency of noise Based on the variational inference method with the function representing the divergence representing the difference between the posterior distribution p (Θ ¦ Y) of the unknown variable 含 む including γ and the invariant function q (Θ) as the objective function Estimate each parameter representing the distribution of each of variable N, variable ρ, variable Z, variable V, variable λ, variable ζ and variable γ so as to minimize the motion objective function, and estimate the position of each sound source k Thus, even when the number of sound sources is unknown, it is possible to simultaneously localize a plurality of sound sources. [0210] The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the scope of the present invention. [0211] For example, although the above-described sound source localization apparatus has a computer system inside, the "computer system" also includes a homepage providing environment (or display environment) if the WWW system is used. [0212] Furthermore, although the present invention has been described as an embodiment in which the program is installed in advance, it is also possible to provide the program by storing the program in a computer readable recording medium. [0213] DESCRIPTION OF SYMBOLS 10 input part 20 calculation part 22 space difference calculation part 24 time frequency expansion part 25 sound source position estimation part 28 variable update part 30 convergence determination part 90 output part 100 sound source localization 03-05-2019 37 apparatus 03-05-2019 38

© Copyright 2021 DropDoc