close

Вход

Забыли?

вход по аккаунту

JP2007235646

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2007235646
In the conventional independent component analysis, when the number of sound sources
exceeds the number of microphones, there is a problem that the performance is degraded. In the
conventional l1 norm minimization method, it is assumed that there is no noise other than the
sound source, and there is a problem that the performance is degraded in the environment where
there is noise other than speech such as echo or reverberation. In the present invention, in
addition to the l1 norm which is a cost function when the l1 norm minimization method
separates sounds, the power of the noise component contained in the separated sound is
considered as the cost value. Also, in the l1 norm minimization method, the cost function is
defined on the premise that speech is not related in the time direction, but in the present
invention, the cost function is assumed on the assumption that speech is related in the time
direction. It has a mechanism that makes it easy to select solutions that have relevance in the
time direction. [Selected figure] Figure 2
Sound source separation device, method and program
[0001]
The present invention relates to a sound source separation apparatus, method, and program for
causing a computer to execute the method of separating sound for each sound source by using
two or more microphones when a plurality of sound sources are arranged at different positions
in space. About.
[0002]
04-05-2019
1
There has been a method based on independent component analysis as a technique for
separating sound for each sound source (see, for example, Non-Patent Document 1).
Independent component analysis is a sound source separation technique that utilizes the fact that
the source signal of a sound source is independent between sound sources. In independent
component analysis, linear filters equal in dimension to the number of microphones are used for
the number of sound sources. If the number of sound sources is equal to or less than the number
of microphones, it is possible to completely restore the original signal. Sound source separation
technology based on independent component analysis is an effective technology when the
number of sound sources is equal to or less than the number of microphones.
[0003]
Also, as a sound source separation technique in the case where the number of sound sources
exceeds the number of microphones, there is an l1 norm minimization method utilizing that the
probability distribution of the power spectrum of speech is not Gaussian but close to Laplace
distribution (for example, Non-Patent Document 2) reference).
[0004]
A. Hyvaerinen, J. Karhunen, and E. Oja, "Independent component analysis," John Wiley & Sons,
2001.
[0005]
P. Bofill and M. Zibulevsky, "Blind separation of more sources than mixtures using sparsity of
their short-time fourier transform," Proc.
ICA 2000, pp. 87-92, 2000/06.
Noboru Murata, Introductory Independent Component Analysis,
Press, pp. 215-216, 2004/07.
[0006]
04-05-2019
2
Tokyo Denki University
In the independent component analysis, when the number of sound sources exceeds the number
of microphones, there is a problem that the performance is degraded. Since the dimension of the
filter coefficient used in the independent component analysis is equal to the number of
microphones, the filtering constraint should be equal to or less than the number of microphones.
If the number of sound sources is less than the number of microphones, even if a specific sound
source is emphasized and all other sound sources are constrained, the number of constraints is at
most several microphones, so the constraints are satisfied. It is possible to generate a filter.
However, when the number of sound sources exceeds the number of microphones, the number of
constraints also exceeds the number of microphones, so that a filter satisfying the constraints can
not be generated, and a sufficiently separated signal can be obtained using the output filter. I can
not do it. In the l1 norm minimization method, it is assumed that there is no noise other than the
sound source, so there is a problem that the performance is degraded in the environment where
there is noise other than speech such as echo or reverberation.
[0007]
The outline of a representative invention disclosed in the present application is as follows. A / D
conversion means for converting an analog signal from a microphone array comprising at least
two or more microphone elements into a digital signal, band division means for band dividing the
digital signal, and the number of microphone elements or more for each band Out of the vectors
where the sound source takes the value 0, the error between the estimated signal and the input
signal calculated from the vector and the steering vector to be registered in advance is minimal
for each vector that has the same value 0 element An error minimum solution calculation means
for outputting a solution, and an optimal model for selecting a solution having a minimum lp
norm value and a weighted sum of the errors among the error minimum solutions for each
number of sound sources having a value of 0 for each band A sound source separation device
comprising a calculation unit and signal synthesis means for converting the selected solution into
a time domain signal, or a program for the execution thereof.
[0008]
According to the present invention, even in an environment where the number of sound sources
exceeds the number of microphones and some background noise and reverberation and
reverberation occur, it is possible to separate the sound for each sound source with a high S / N.
As a result, it becomes possible to make a call with a sound that is easy to hear, such as in a
hands-free call.
04-05-2019
3
[0009]
The hardware configuration of this embodiment is shown in FIG. The central processing unit 1
carries out all the calculations included in this embodiment. The recording device 2 is a work
memory configured of, for example, a RAM, and all variables used when performing calculations
are secured on the storage device 2. It is assumed that all data and programs used at the time of
calculation are stored in the storage device 3 configured by, for example, a ROM. The
microphone array 4 is composed of at least two or more microphone elements. Each microphone
element measures an analog sound pressure value. The number of microphone elements is M.
The A / D conversion device is a device that converts (samples) an analog signal into a digital
signal, and is a device that can synchronously sample signals of M channels or more. The analog
sound pressure value for each microphone element captured by the microphone array 4 is sent
to the A / D converter 5. The number of sounds to be separated is set in advance and stored in
the recording device 2 or 3. The number of sounds to be separated is denoted as N. Since the
processing amount increases as N is larger, a value that can be processed by the processing
capacity of the central processing unit 1 is set.
[0010]
A block diagram of the software of this embodiment is shown in FIG. In the present invention, the
power of the noise component contained in the separated sound is considered as the cost value,
in addition to the 11 norm which is the cost function when the 11 norm minimization method
separates the sound. The optimal model selection unit 205 outputs the minimum solution of the
weighted sum of the noise signal power and the l1 norm value. Also, in the l1 norm minimization
method, the cost function is defined on the premise that speech is not related in the time
direction, but in the present invention, the cost function is assumed on the assumption that
speech is related in the time direction. It has a mechanism that makes it easy to select solutions
that have relevance in the time direction.
[0011]
Each means is implemented in the central processing unit 1. 201The A / D conversion means
converts analog sound pressure values into digital data for each channel. Conversion to digital
data in the A / D conversion device 5 is performed at the timing of a sampling rate set in
04-05-2019
4
advance. For example, when the sampling rate is 11025 Hz, the data is converted into digital
data at equal intervals 11025 times per second. Let the converted digital data be x (t, j). t is a
discretized time. The point in time when the A / D conversion device 5 starts A / D conversion is t
= 0, and t is incremented by one each time sampling is performed. j is the number of the
microphone element. For example, the 100th sampling data of the 0th microphone element is
represented as x (100, 0). The contents of x (t, j) are written to the area set in the RAM 2 for each
sampling. Alternatively, the sampled data is temporarily stored in a buffer in the A / D converter
5, and each time a certain amount of data is accumulated in the buffer, the data is transferred to
the set area of the RAM 2 That way is fine. An area in the RAM 2 in which the content of x (t, j) is
written is defined as x (t, j).
[0012]
202
The band division means performs Fourier transform or wavelet analysis on data from t = τ *
frame̲shift to t = τ * frame̲shift + frame̲size to convert it into a band divided signal. Band
division signals are j = 1 ... Do each microphone element up to M. The converted band division
signal is expressed as (Equation 1) as a vector having the signal of each microphone element as
an element.
[0013]
[0014]
Here, f is an index that means a band division number.
[0015]
Sounds like human voice and music rarely take large amplitude values, and are sparse signals
that take many values of zero.
Therefore, the speech signal can be approximated by a Laplace distribution having a high
probability of taking a value of 0, rather than a Gaussian distribution.
04-05-2019
5
When the speech signal is approximated by the Laplace distribution, the log likelihood can be
considered as a sign obtained by inverting the sign of the l1 norm value. Also, noise signals
mixed with echoes, reverberations, and background noises can be approximated by Gaussian
distribution. Therefore, the log likelihood of the noise signal included in the input signal can be
considered as a value obtained by inverting the sign of the squared error between the input
signal and the audio signal. From the viewpoint of MAP estimation for finding the solution most
likely in probability (maximum likelihood solution), the solution with the largest sum of the log
likelihood of the noise signal and the log likelihood of the speech signal is the maximum
likelihood solution. The signal with the smallest squared error of and the weighted sum of the l1
norm values can be considered as the maximum likelihood solution. However, it is difficult to find
such a solution, so it is necessary to make some approximation to find a solution. For example, in
the l1 norm minimum method, there is no error from the input signal, and a signal that
minimizes the weighted sum of the l1 norm values is obtained as a solution. However, in an
environment where echoes, reverberations and background noises are present, such an
approximation is a rough approximation because it can not be assumed that there is no error
with the input signal, leading to degradation of separation performance. Therefore, in the present
invention, on the assumption that there is an error with the input signal, a signal that minimizes
the weighted sum of the error with the input signal and the l1 norm value is approximately
obtained. As described above, sounds such as human voice and music are sparse signals that
rarely have large amplitude values. That is, it can be considered that the signal has many
elements with the value 0. Therefore, it is assumed that, for each time and frequency, only sound
sources less than the number of microphones have amplitude values other than zero. In addition,
the l1 norm value becomes smaller as the number of elements having a value of 0 increases, and
increases as the number of elements having a value of 0 decreases, so it can be considered as a
measure of sparseness (Non-Patent Document) 3). Therefore, when the number of sound sources
having the value 0 is equal, the l1 norm value is approximated to be a constant value. When this
approximation is applied, when the number of sound sources is N, among the Nth-order complex
vectors that take the value of 0, what can be a solution candidate is a solution with the smallest
error from the input signal.
[0016]
So, first of all, with 203 error minimum solution calculation means,
[0017]
[0018]
04-05-2019
6
According to, calculate an error minimum solution for each L-order sparse set.
An L-order sparse set is a complex vector of order N such that L elements have the value 0.
The calculated minimum error solution is the maximum likelihood solution of each source signal
in the Lth-order sparse set. The minimum error solution is a complex vector of order N. Each
element is an estimate of the original signal of each sound source. A (f) is a complex matrix of M
rows and N columns and is a matrix having, in a column, how to transmit sound (steering vector)
from each sound source position to the microphone element. For example, the first column of A
(f) is the steering vector from the first sound source to the microphone array. A (f) is calculated
by the direction search unit 209 and output. 203Then, the error minimum solution is calculated
for each L of L = 1 to M. When L = M, a plurality of error minimum solutions are calculated, but
in this case, all the plurality of solutions are output as L = M error minimum solutions. Here, the
minimum error solution is determined for each N-order complex vector having the same number
of sound sources having the value 0, but solutions are obtained for each N-order complex vector
having equal elements having the value 0 as well as the number of the sound sources It is good.
However, even if the elements with the value 0 are not equal, approximation can be made if the
l1 norm value becomes a constant value only by the number of sound sources being equal, so it
is sufficient to obtain the minimum error solution for each number of sound sources with the
value 0 it is conceivable that.
[0019]
It is also possible to apply (Equation 3) instead of (Equation 2) above.
[0020]
[0021]
Ω L, i is a set of N order complex vectors in which the value of the same element is 0 in the L
order sparse set.
The power of speech has a positive correlation in the time direction.
04-05-2019
7
Therefore, it is considered that a sound source which takes a large value at a certain τ is likely
to take a large value at τ ± k as well. If this is developed, it can be considered that the solution
with a smaller moving average of the error term in the direction of τ is a solution closer to the
true solution. That is, a solution closer to a true solution can be obtained by setting the moving
average of the error term as a new error term for each model ΩL, i. γ (m) is the weight of the
moving average. This configuration facilitates selection of solutions having relevance in the time
direction. When a moving average is used to obtain an error minimum solution, it is necessary to
calculate the error minimum solution for each N-order complex vector having equal elements as
well as the number of sound sources having a value of zero. This is because, even if the number
of sound sources is equal, different elements can not be approximated by having a positive
correlation in the time direction.
[0022]
204In the lp norm calculation method of, the minimum error solution calculated for each Lthorder sparse set is used.
[0023]
[0024]
Calculate the lp norm value with.
[0025]
[0026]
は、
[0027]
[0028]
Is the ith element of
04-05-2019
8
p is a parameter set in advance between 0 and 1;
The lp norm value is a measure of the degree of sparseness of (Equation 6) (see Non-Patent
Document 3), and the smaller the value is, the more elements included near 0 are included in
(Equation 6).
Since speech is sparse, it can be considered that (Expression 6) is closer to a true solution as the
value of (Expression 4) is smaller.
That is, Equation 4 can be used as a selection criterion in selecting a true solution.
The calculated value of lp norm of (Equation 4) is the moving average as well as the calculation
of the minimum error solution.
[0029]
[0030]
It is also possible to replace it.
Since the power of speech has a positive correlation in the time direction, it is possible to obtain a
solution closer to a true solution by replacing it with the moving average.
The power of voice changes only slowly in the time direction. Therefore, a sound source which
takes a large amplitude value in a certain frame can be considered to take a large amplitude
value also in the preceding and succeeding frames. 205In the optimal model selection section of
the above, the optimal solution among the error minimum solutions obtained for each Lth-order
sparse set
[0031]
04-05-2019
9
[0032]
[0033]
Ask based on
(Eq. 8) (Eq. 9) outputs a solution that minimizes the weighted average of the error term and the
lp norm term.
This solution is also the posterior probability maximum solution. The equation for finding the
optimal solution (Equation 8) (Eq. 9) is the moving average value, as in the case of the error
minimum solution and the l1 norm minimum solution.
[0034]
[0035]
It is possible to replace with.
[0036]
In the processing corresponding to the conventional 205, L = 2.
Although there is a method that does not select solutions up to M and uses L = 1 as an optimal
solution, this method has a problem that musical noise is generated.
The solution of L = 1 is a solution in which values other than one sound source have a value of 0
every f and τ. Sometimes there is a solution where the value is close to 0 for all but one sound
source. When it holds, the solution with L = 1 is the optimum solution, but it does not always
hold. If L = 1 is always assumed, when two or more sound sources take large values, the solution
can not be calculated and musical noise occurs. 205In order to find the optimal solution among
04-05-2019
10
the error minimum solutions found for each L-order sparse set, the system has a mechanism to
determine which sparse set is optimal, so to say, L = 1 to M, and two or more sound sources Even
when the value of becomes larger than 0, the solution can be calculated and the occurrence of
musical noise can be suppressed.
[0037]
206In the signal synthesis means of the above, the optimal solution calculated for each band
[0038]
[0039]
Apply an inverse Fourier transform or inverse wavelet transform to the time domain signal
[0040]
[0041]
Back to.
By doing this, it is possible to obtain an estimated signal of the time domain of each sound
source.
207In the sound source localization part of
[0042]
[0043]
Based on, calculate the sound source direction.
04-05-2019
11
Ω is a search range of the sound source, which is set in advance in the ROM 3.
[0044]
[0045]
Is the steering vector from the sound source direction θ to the microphone array, and the
magnitude is normalized to 1.
Assuming that the original signal is s (f, τ), the sound coming from the sound source direction θ
is
[0046]
[0047]
It is observed as
It is assumed that Ω of all the sound sources included in (Expression 13) is stored in advance in
the ROM 3.
208The direction power calculation unit of
[0048]
[0049]
Calculate with
04-05-2019
12
δ is a function which is 1 only when the equation of arguments holds, and 0 when it does not
hold.
209 The direction search unit peak-searches P (θ) to calculate the sound source direction, and
outputs M rows and N columns of steering vector matrix A (f) having the calculated steering
vectors in the sound source direction in the column. In the peak search, P (θ) may be arranged
in descending order, and the top N may be calculated as the sound source direction, or the top in
the case where P (θ) exceeds the front and rear directions (when the maximum value is
obtained). N may be calculated as the sound source direction. 203Then, this information is used
as A (f) in (Equation 2) to find the minimum error solution. 209Even if the sound source direction
is unknown in advance, the sound source direction can be automatically estimated and sound
source separation can be performed by searching for A (f) in the direction search unit.
[0050]
The processing flow of this embodiment is shown in FIG. An input voice is received as a sound
pressure value at each microphone element. The sound pressure value of each microphone
element is converted into digital data, and band division processing of frame̲size is performed
while shifting the data for each frame̲shift (S1). The sound source direction is estimated using
only τ = 1... K among the obtained band division signals, and a steering vector matrix A (f) is
calculated (S2). We use A (f) to search for the true solution of the τ = 1... Band split signal. Then,
the obtained optimum solutions are combined to obtain an estimated signal for each sound
source (S3). The estimated signal for each sound source synthesized in (S3) is an output signal.
This output signal is a signal in which the sound is separated for each sound source, and the
utterance content for each sound source is a sound that is easy to hear.
[0051]
The figure which showed the hardware constitutions of this invention. Block diagram of the
software of the present invention. The processing flow figure of the present invention.
Explanation of sign
04-05-2019
13
[0052]
DESCRIPTION OF SYMBOLS 1 ... Central processing unit, 2 ... Memory ¦ storage device comprised
by RAM etc., 3 ... Memory ¦ storage device comprised by ROM etc. 4 ... Microphone array which
consists of at least 2 or more microphone element, 5 ... Analog sound pressure value A / D
converter for converting digital data into digital data, 201 ... A / D conversion means, 202 ... band
division means for converting digital sound pressure data into band division signals, 203 ...
minimum error solution calculation means, 204 ... lp norm Calculation means 205: Optimal
model calculation unit 206: Signal synthesis means 207: Sound source localization unit 208:
Directional power calculation unit 209: Direction search unit S1: Input voice reception and band
division processing S2: Steering vector Matrix calculation processing, S3 ... signal synthesis
processing.
04-05-2019
14
1/--страниц
Пожаловаться на содержимое документа