Patent Translate Powered by EPO and Google Notice This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate, complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or financial decisions, should not be based on machine-translation output. DESCRIPTION JP2007235646 In the conventional independent component analysis, when the number of sound sources exceeds the number of microphones, there is a problem that the performance is degraded. In the conventional l1 norm minimization method, it is assumed that there is no noise other than the sound source, and there is a problem that the performance is degraded in the environment where there is noise other than speech such as echo or reverberation. In the present invention, in addition to the l1 norm which is a cost function when the l1 norm minimization method separates sounds, the power of the noise component contained in the separated sound is considered as the cost value. Also, in the l1 norm minimization method, the cost function is defined on the premise that speech is not related in the time direction, but in the present invention, the cost function is assumed on the assumption that speech is related in the time direction. It has a mechanism that makes it easy to select solutions that have relevance in the time direction. [Selected figure] Figure 2 Sound source separation device, method and program [0001] The present invention relates to a sound source separation apparatus, method, and program for causing a computer to execute the method of separating sound for each sound source by using two or more microphones when a plurality of sound sources are arranged at different positions in space. About. [0002] 04-05-2019 1 There has been a method based on independent component analysis as a technique for separating sound for each sound source (see, for example, Non-Patent Document 1). Independent component analysis is a sound source separation technique that utilizes the fact that the source signal of a sound source is independent between sound sources. In independent component analysis, linear filters equal in dimension to the number of microphones are used for the number of sound sources. If the number of sound sources is equal to or less than the number of microphones, it is possible to completely restore the original signal. Sound source separation technology based on independent component analysis is an effective technology when the number of sound sources is equal to or less than the number of microphones. [0003] Also, as a sound source separation technique in the case where the number of sound sources exceeds the number of microphones, there is an l1 norm minimization method utilizing that the probability distribution of the power spectrum of speech is not Gaussian but close to Laplace distribution (for example, Non-Patent Document 2) reference). [0004] A. Hyvaerinen, J. Karhunen, and E. Oja, "Independent component analysis," John Wiley & Sons, 2001. [0005] P. Bofill and M. Zibulevsky, "Blind separation of more sources than mixtures using sparsity of their short-time fourier transform," Proc. ICA 2000, pp. 87-92, 2000/06. Noboru Murata, Introductory Independent Component Analysis, Press, pp. 215-216, 2004/07. [0006] 04-05-2019 2 Tokyo Denki University In the independent component analysis, when the number of sound sources exceeds the number of microphones, there is a problem that the performance is degraded. Since the dimension of the filter coefficient used in the independent component analysis is equal to the number of microphones, the filtering constraint should be equal to or less than the number of microphones. If the number of sound sources is less than the number of microphones, even if a specific sound source is emphasized and all other sound sources are constrained, the number of constraints is at most several microphones, so the constraints are satisfied. It is possible to generate a filter. However, when the number of sound sources exceeds the number of microphones, the number of constraints also exceeds the number of microphones, so that a filter satisfying the constraints can not be generated, and a sufficiently separated signal can be obtained using the output filter. I can not do it. In the l1 norm minimization method, it is assumed that there is no noise other than the sound source, so there is a problem that the performance is degraded in the environment where there is noise other than speech such as echo or reverberation. [0007] The outline of a representative invention disclosed in the present application is as follows. A / D conversion means for converting an analog signal from a microphone array comprising at least two or more microphone elements into a digital signal, band division means for band dividing the digital signal, and the number of microphone elements or more for each band Out of the vectors where the sound source takes the value 0, the error between the estimated signal and the input signal calculated from the vector and the steering vector to be registered in advance is minimal for each vector that has the same value 0 element An error minimum solution calculation means for outputting a solution, and an optimal model for selecting a solution having a minimum lp norm value and a weighted sum of the errors among the error minimum solutions for each number of sound sources having a value of 0 for each band A sound source separation device comprising a calculation unit and signal synthesis means for converting the selected solution into a time domain signal, or a program for the execution thereof. [0008] According to the present invention, even in an environment where the number of sound sources exceeds the number of microphones and some background noise and reverberation and reverberation occur, it is possible to separate the sound for each sound source with a high S / N. As a result, it becomes possible to make a call with a sound that is easy to hear, such as in a hands-free call. 04-05-2019 3 [0009] The hardware configuration of this embodiment is shown in FIG. The central processing unit 1 carries out all the calculations included in this embodiment. The recording device 2 is a work memory configured of, for example, a RAM, and all variables used when performing calculations are secured on the storage device 2. It is assumed that all data and programs used at the time of calculation are stored in the storage device 3 configured by, for example, a ROM. The microphone array 4 is composed of at least two or more microphone elements. Each microphone element measures an analog sound pressure value. The number of microphone elements is M. The A / D conversion device is a device that converts (samples) an analog signal into a digital signal, and is a device that can synchronously sample signals of M channels or more. The analog sound pressure value for each microphone element captured by the microphone array 4 is sent to the A / D converter 5. The number of sounds to be separated is set in advance and stored in the recording device 2 or 3. The number of sounds to be separated is denoted as N. Since the processing amount increases as N is larger, a value that can be processed by the processing capacity of the central processing unit 1 is set. [0010] A block diagram of the software of this embodiment is shown in FIG. In the present invention, the power of the noise component contained in the separated sound is considered as the cost value, in addition to the 11 norm which is the cost function when the 11 norm minimization method separates the sound. The optimal model selection unit 205 outputs the minimum solution of the weighted sum of the noise signal power and the l1 norm value. Also, in the l1 norm minimization method, the cost function is defined on the premise that speech is not related in the time direction, but in the present invention, the cost function is assumed on the assumption that speech is related in the time direction. It has a mechanism that makes it easy to select solutions that have relevance in the time direction. [0011] Each means is implemented in the central processing unit 1. 201The A / D conversion means converts analog sound pressure values into digital data for each channel. Conversion to digital data in the A / D conversion device 5 is performed at the timing of a sampling rate set in 04-05-2019 4 advance. For example, when the sampling rate is 11025 Hz, the data is converted into digital data at equal intervals 11025 times per second. Let the converted digital data be x (t, j). t is a discretized time. The point in time when the A / D conversion device 5 starts A / D conversion is t = 0, and t is incremented by one each time sampling is performed. j is the number of the microphone element. For example, the 100th sampling data of the 0th microphone element is represented as x (100, 0). The contents of x (t, j) are written to the area set in the RAM 2 for each sampling. Alternatively, the sampled data is temporarily stored in a buffer in the A / D converter 5, and each time a certain amount of data is accumulated in the buffer, the data is transferred to the set area of the RAM 2 That way is fine. An area in the RAM 2 in which the content of x (t, j) is written is defined as x (t, j). [0012] 202 The band division means performs Fourier transform or wavelet analysis on data from t = τ * frame̲shift to t = τ * frame̲shift + frame̲size to convert it into a band divided signal. Band division signals are j = 1 ... Do each microphone element up to M. The converted band division signal is expressed as (Equation 1) as a vector having the signal of each microphone element as an element. [0013] [0014] Here, f is an index that means a band division number. [0015] Sounds like human voice and music rarely take large amplitude values, and are sparse signals that take many values of zero. Therefore, the speech signal can be approximated by a Laplace distribution having a high probability of taking a value of 0, rather than a Gaussian distribution. 04-05-2019 5 When the speech signal is approximated by the Laplace distribution, the log likelihood can be considered as a sign obtained by inverting the sign of the l1 norm value. Also, noise signals mixed with echoes, reverberations, and background noises can be approximated by Gaussian distribution. Therefore, the log likelihood of the noise signal included in the input signal can be considered as a value obtained by inverting the sign of the squared error between the input signal and the audio signal. From the viewpoint of MAP estimation for finding the solution most likely in probability (maximum likelihood solution), the solution with the largest sum of the log likelihood of the noise signal and the log likelihood of the speech signal is the maximum likelihood solution. The signal with the smallest squared error of and the weighted sum of the l1 norm values can be considered as the maximum likelihood solution. However, it is difficult to find such a solution, so it is necessary to make some approximation to find a solution. For example, in the l1 norm minimum method, there is no error from the input signal, and a signal that minimizes the weighted sum of the l1 norm values is obtained as a solution. However, in an environment where echoes, reverberations and background noises are present, such an approximation is a rough approximation because it can not be assumed that there is no error with the input signal, leading to degradation of separation performance. Therefore, in the present invention, on the assumption that there is an error with the input signal, a signal that minimizes the weighted sum of the error with the input signal and the l1 norm value is approximately obtained. As described above, sounds such as human voice and music are sparse signals that rarely have large amplitude values. That is, it can be considered that the signal has many elements with the value 0. Therefore, it is assumed that, for each time and frequency, only sound sources less than the number of microphones have amplitude values other than zero. In addition, the l1 norm value becomes smaller as the number of elements having a value of 0 increases, and increases as the number of elements having a value of 0 decreases, so it can be considered as a measure of sparseness (Non-Patent Document) 3). Therefore, when the number of sound sources having the value 0 is equal, the l1 norm value is approximated to be a constant value. When this approximation is applied, when the number of sound sources is N, among the Nth-order complex vectors that take the value of 0, what can be a solution candidate is a solution with the smallest error from the input signal. [0016] So, first of all, with 203 error minimum solution calculation means, [0017] [0018] 04-05-2019 6 According to, calculate an error minimum solution for each L-order sparse set. An L-order sparse set is a complex vector of order N such that L elements have the value 0. The calculated minimum error solution is the maximum likelihood solution of each source signal in the Lth-order sparse set. The minimum error solution is a complex vector of order N. Each element is an estimate of the original signal of each sound source. A (f) is a complex matrix of M rows and N columns and is a matrix having, in a column, how to transmit sound (steering vector) from each sound source position to the microphone element. For example, the first column of A (f) is the steering vector from the first sound source to the microphone array. A (f) is calculated by the direction search unit 209 and output. 203Then, the error minimum solution is calculated for each L of L = 1 to M. When L = M, a plurality of error minimum solutions are calculated, but in this case, all the plurality of solutions are output as L = M error minimum solutions. Here, the minimum error solution is determined for each N-order complex vector having the same number of sound sources having the value 0, but solutions are obtained for each N-order complex vector having equal elements having the value 0 as well as the number of the sound sources It is good. However, even if the elements with the value 0 are not equal, approximation can be made if the l1 norm value becomes a constant value only by the number of sound sources being equal, so it is sufficient to obtain the minimum error solution for each number of sound sources with the value 0 it is conceivable that. [0019] It is also possible to apply (Equation 3) instead of (Equation 2) above. [0020] [0021] Ω L, i is a set of N order complex vectors in which the value of the same element is 0 in the L order sparse set. The power of speech has a positive correlation in the time direction. 04-05-2019 7 Therefore, it is considered that a sound source which takes a large value at a certain τ is likely to take a large value at τ ± k as well. If this is developed, it can be considered that the solution with a smaller moving average of the error term in the direction of τ is a solution closer to the true solution. That is, a solution closer to a true solution can be obtained by setting the moving average of the error term as a new error term for each model ΩL, i. γ (m) is the weight of the moving average. This configuration facilitates selection of solutions having relevance in the time direction. When a moving average is used to obtain an error minimum solution, it is necessary to calculate the error minimum solution for each N-order complex vector having equal elements as well as the number of sound sources having a value of zero. This is because, even if the number of sound sources is equal, different elements can not be approximated by having a positive correlation in the time direction. [0022] 204In the lp norm calculation method of, the minimum error solution calculated for each Lthorder sparse set is used. [0023] [0024] Calculate the lp norm value with. [0025] [0026] は、 [0027] [0028] Is the ith element of 04-05-2019 8 p is a parameter set in advance between 0 and 1; The lp norm value is a measure of the degree of sparseness of (Equation 6) (see Non-Patent Document 3), and the smaller the value is, the more elements included near 0 are included in (Equation 6). Since speech is sparse, it can be considered that (Expression 6) is closer to a true solution as the value of (Expression 4) is smaller. That is, Equation 4 can be used as a selection criterion in selecting a true solution. The calculated value of lp norm of (Equation 4) is the moving average as well as the calculation of the minimum error solution. [0029] [0030] It is also possible to replace it. Since the power of speech has a positive correlation in the time direction, it is possible to obtain a solution closer to a true solution by replacing it with the moving average. The power of voice changes only slowly in the time direction. Therefore, a sound source which takes a large amplitude value in a certain frame can be considered to take a large amplitude value also in the preceding and succeeding frames. 205In the optimal model selection section of the above, the optimal solution among the error minimum solutions obtained for each Lth-order sparse set [0031] 04-05-2019 9 [0032] [0033] Ask based on (Eq. 8) (Eq. 9) outputs a solution that minimizes the weighted average of the error term and the lp norm term. This solution is also the posterior probability maximum solution. The equation for finding the optimal solution (Equation 8) (Eq. 9) is the moving average value, as in the case of the error minimum solution and the l1 norm minimum solution. [0034] [0035] It is possible to replace with. [0036] In the processing corresponding to the conventional 205, L = 2. Although there is a method that does not select solutions up to M and uses L = 1 as an optimal solution, this method has a problem that musical noise is generated. The solution of L = 1 is a solution in which values other than one sound source have a value of 0 every f and τ. Sometimes there is a solution where the value is close to 0 for all but one sound source. When it holds, the solution with L = 1 is the optimum solution, but it does not always hold. If L = 1 is always assumed, when two or more sound sources take large values, the solution can not be calculated and musical noise occurs. 205In order to find the optimal solution among 04-05-2019 10 the error minimum solutions found for each L-order sparse set, the system has a mechanism to determine which sparse set is optimal, so to say, L = 1 to M, and two or more sound sources Even when the value of becomes larger than 0, the solution can be calculated and the occurrence of musical noise can be suppressed. [0037] 206In the signal synthesis means of the above, the optimal solution calculated for each band [0038] [0039] Apply an inverse Fourier transform or inverse wavelet transform to the time domain signal [0040] [0041] Back to. By doing this, it is possible to obtain an estimated signal of the time domain of each sound source. 207In the sound source localization part of [0042] [0043] Based on, calculate the sound source direction. 04-05-2019 11 Ω is a search range of the sound source, which is set in advance in the ROM 3. [0044] [0045] Is the steering vector from the sound source direction θ to the microphone array, and the magnitude is normalized to 1. Assuming that the original signal is s (f, τ), the sound coming from the sound source direction θ is [0046] [0047] It is observed as It is assumed that Ω of all the sound sources included in (Expression 13) is stored in advance in the ROM 3. 208The direction power calculation unit of [0048] [0049] Calculate with 04-05-2019 12 δ is a function which is 1 only when the equation of arguments holds, and 0 when it does not hold. 209 The direction search unit peak-searches P (θ) to calculate the sound source direction, and outputs M rows and N columns of steering vector matrix A (f) having the calculated steering vectors in the sound source direction in the column. In the peak search, P (θ) may be arranged in descending order, and the top N may be calculated as the sound source direction, or the top in the case where P (θ) exceeds the front and rear directions (when the maximum value is obtained). N may be calculated as the sound source direction. 203Then, this information is used as A (f) in (Equation 2) to find the minimum error solution. 209Even if the sound source direction is unknown in advance, the sound source direction can be automatically estimated and sound source separation can be performed by searching for A (f) in the direction search unit. [0050] The processing flow of this embodiment is shown in FIG. An input voice is received as a sound pressure value at each microphone element. The sound pressure value of each microphone element is converted into digital data, and band division processing of frame̲size is performed while shifting the data for each frame̲shift (S1). The sound source direction is estimated using only τ = 1... K among the obtained band division signals, and a steering vector matrix A (f) is calculated (S2). We use A (f) to search for the true solution of the τ = 1... Band split signal. Then, the obtained optimum solutions are combined to obtain an estimated signal for each sound source (S3). The estimated signal for each sound source synthesized in (S3) is an output signal. This output signal is a signal in which the sound is separated for each sound source, and the utterance content for each sound source is a sound that is easy to hear. [0051] The figure which showed the hardware constitutions of this invention. Block diagram of the software of the present invention. The processing flow figure of the present invention. Explanation of sign 04-05-2019 13 [0052] DESCRIPTION OF SYMBOLS 1 ... Central processing unit, 2 ... Memory ¦ storage device comprised by RAM etc., 3 ... Memory ¦ storage device comprised by ROM etc. 4 ... Microphone array which consists of at least 2 or more microphone element, 5 ... Analog sound pressure value A / D converter for converting digital data into digital data, 201 ... A / D conversion means, 202 ... band division means for converting digital sound pressure data into band division signals, 203 ... minimum error solution calculation means, 204 ... lp norm Calculation means 205: Optimal model calculation unit 206: Signal synthesis means 207: Sound source localization unit 208: Directional power calculation unit 209: Direction search unit S1: Input voice reception and band division processing S2: Steering vector Matrix calculation processing, S3 ... signal synthesis processing. 04-05-2019 14

1/--страниц