JP2014059180

Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2014059180
Abstract: To provide a sound source direction estimation device, a sound source direction
estimation method, and a sound source direction estimation program capable of improving the
processing efficiency concerning estimation of a sound source direction. A transfer function
storage unit 112 stores transfer functions from a sound source for each direction of a sound
source, and the number of layers to be searched based on a desired search range and a desired
spatial resolution for searching the direction of the sound source. And the number-of-layers
calculation unit 107 that calculates the search interval for each hierarchy, the search range is
searched using the transfer function for each search interval, the direction of the sound source is
estimated based on the search result, And a sound source localization unit (peak search unit 110,
STF processing unit 111) which updates the search range and the search interval based on the
number of layers calculated by the calculation unit and estimates the direction of the sound
source. [Selected figure] Figure 3
Sound source direction estimation device, sound source direction estimation method, and sound
source direction estimation program
[0001]
The present invention relates to a sound source direction estimation device, a sound source
direction estimation method, and a sound source direction estimation program.
[0002]
It has been proposed that speech recognition be performed on the sound emitted from the sound
03-05-2019
1
signal emitted from the sound source.
In such speech recognition, a noise signal is separated from an acoustic signal, or the noise signal
is suppressed to extract an acoustic signal of a target to be recognized. Then, for example, voice
recognition is performed on the extracted acoustic signal. In such a system, in order to extract a
voice to be recognized, it is estimated whether the direction in which the acoustic signal is
emitted is known or the direction in which the acoustic signal is emitted.
[0003]
For example, in Patent Document 1, the type of the sound source of the acoustic signal is
identified based on the acoustic feature amount of the input acoustic signal, and the sound
source direction is estimated for the acoustic signal of the identified type of sound source.
Moreover, in Patent Document 1, the information of such sound source direction is estimated
(hereinafter referred to as sound source localization) using GEVD (generalized eigenvalue
decomposition) -MUSIC method or GSVD (generalized singular value decomposition) -MUSIC
method ing. As described above, when the GEVD-MUSIC method or the GSVD-MUSIC method is
used, the calculation efficiency of the sound source localization is increased in the sound source
estimation device.
[0004]
Unexamined-Japanese-Patent No. 2012-42465
[0005]
However, in such a sound source estimation device, in the range where the sound source
direction is searched, transfer functions associated with each sound source direction are obtained
in advance by measurement or calculation, and stored in the device.
Then, in such a sound source estimation device, the space spectrum is calculated using the
transfer function stored in the storage unit, and the sound source direction is determined based
on the calculated space spectrum. Therefore, in order to increase the estimation accuracy of the
sound source direction, transfer functions associated with a large number of sound source
directions are required. Therefore, in the conventional sound source estimation apparatus, there
03-05-2019
2
is a problem that the calculation amount is large and the calculation efficiency is poor in order to
increase the estimation accuracy of the sound source direction.
[0006]
The present invention has been made in view of the above-described points, and an object of the
present invention is to provide a sound source direction estimation device, a sound source
direction estimation method, and a sound source direction estimation program capable of
improving processing efficiency concerning estimation of sound source direction. And
[0007]
(1) In order to achieve the above object, a sound source direction estimation device according to
one aspect of the present invention searches for a direction of the sound source, a transfer
function storage unit that stores a transfer function from the sound source for each direction of
the sound source. Calculating the number of layers to be searched and the search interval for
each hierarchy based on the desired search range and the desired spatial resolution, and
searching the search range using the transfer function at each search interval, The direction of
the sound source is estimated based on the searched result, and the search range and the search
interval are updated until the number of layers calculated by the calculation unit is reached
based on the estimated direction of the sound source, and the direction of the sound source And
a sound source localization unit for estimating
[0008]
(2) Further, in the sound source direction estimation device according to the aspect of the
present invention, the sound source localization unit calculates the nth (n is an integer of 1 or
more) hierarchy in the search range defined in advance. The search is performed at the specified
search interval, and based on the search result, at least one of the search intervals in the search
range is updated as a search range of the (n + 1) -th layer, and the updated (n + 1) -th layer is
updated. The search interval of the (n + 1) -th layer to be searched is updated based on the
search range and the desired spatial resolution, and the updated search range of the (n + 1) -th
layer and the updated ((n + 1)) hierarchy The direction of the sound source is updated and
estimated until the number of layers (n + 1) reaches the number of layers calculated by the
calculation unit using the search interval of n + 1) layers and the transfer function corresponding
to the direction You may
[0009]
(3) Further, in the sound source direction estimation device according to the aspect of the
present invention, the calculation unit may calculate the number of layers and the search interval
03-05-2019
3
so that the number of searches for each layer in all the layers is equal. It may be calculated.
[0010]
(4) Further, in the sound source direction estimation device according to the aspect of the
present invention, the calculation unit calculates the number of layers and the search interval
such that a total number of searches in all layers is minimized. You may do so.
[0011]
(5) Further, in the sound source direction estimation device according to the aspect of the
present invention, the sound source localization unit may store the transfer function
corresponding to an azimuth of the search interval in the transfer function storage unit. When it
is determined that the transfer function corresponding to the direction of the search interval is
stored in the transfer function storage unit, the transfer function corresponding to the direction
of the search interval is read from the transfer function storage unit When it is determined that
the transfer function corresponding to the direction of the search interval is not stored in the
transfer function storage unit, an interpolation transfer function is calculated by interpolating the
transfer function corresponding to the direction of the search interval, The direction of the sound
source may be estimated using the read transfer function or the calculated interpolation transfer
function.
[0012]
(6) Further, in the sound source direction estimation device according to one aspect of the
present invention, the calculation unit is a calculation that is a total value of a search cost for the
number of searches in the search range and an interpolation cost for the interpolation. The
number of layers and the search interval may be calculated so as to minimize the cost.
[0013]
(7) In order to achieve the above object, a sound source direction estimation method according to
an aspect of the present invention is a sound source direction estimation method in a sound
source direction estimation device, and a calculation unit is a desired one for searching for a
sound source direction. A calculation procedure for calculating the number of layers to be
searched and the search interval for each layer based on a search range and a desired spatial
resolution, and a sound source localization unit is stored in the transfer function storage unit for
each direction of the sound source Searching the search range at each search interval using a
transfer function from the sound source, and estimating the direction of the sound source based
on the search result, and the sound source localization unit is directed to the direction of the
sound source estimated And updating the search range and the search interval based on the
03-05-2019
4
number of layers calculated by the calculation procedure to estimate the direction of the sound
source.
[0014]
(8) In order to achieve the above object, a sound source direction estimation program according
to an aspect of the present invention is based on a desired search range and a desired spatial
resolution for searching for the direction of a sound source in a computer of a sound source
direction estimation device. Using the transfer function from the sound source stored in the
transfer function storage unit for each direction of the sound source, the search range is
calculated using the number of layers to be searched and the calculation procedure for
calculating the search interval for each layer. The number of layers calculated according to the
calculation procedure: the procedure of searching for each search interval and estimating the
direction of the sound source based on the search result; and the search range and the search
interval based on the estimated direction of the sound source And updating the information to
estimate the direction of the sound source.
[0015]
According to the aspects (1), (2), (7) and (8) of the present invention, the calculation unit
performs the search based on the desired search range and the desired spatial resolution for
searching the direction of the sound source. Since the number of layers to be performed and the
search interval for each layer are calculated, the processing time required for the direction of the
sound source can be shortened.
According to the aspects (3) and (4) of the present invention, since the calculation unit can
calculate the number of layers to be searched with a small amount of calculation and the search
interval for each layer, the processing time for the direction of the sound source can be
shortened. Can.
According to the aspect (5) of the present invention, when the transfer function corresponding to
the direction to be searched is not stored in the transfer function storage unit, the sound source
localization unit searches for the direction using the interpolation transfer function
complemented with the transfer function. When the transfer function corresponding to is stored
in the transfer function storage unit, since the direction of the sound source is estimated using
the read transfer function, the direction of the sound source can be accurately estimated.
03-05-2019
5
According to the aspect (6) of the present invention, the number of layers and the search interval
are set so that the sound source localization unit minimizes the calculation cost which is the total
value of the search cost for the number of searches and the interpolation cost for interpolation.
Since the calculation is performed, the processing time required for the direction of the sound
source can be shortened.
[0016]
It is a figure explaining the environment in direction estimation of a sound source.
It is a figure which shows the outline of a process of the sound processing apparatus which
concerns on 1st Embodiment.
It is a block diagram of a sound source localization part concerning a 1st embodiment.
It is a block diagram of a sound source separation part concerning a 1st embodiment.
It is a flowchart which shows the procedure of the hierarchy search process which concerns on
1st Embodiment.
It is a figure explaining the procedure of the hierarchy search process which concerns on 1st
Embodiment.
It is a figure explaining the calculation procedure of the hierarchy value concerning a 1st
embodiment. It is a figure explaining the calculation procedure of the number of layers and space
¦ interval which concern on 2nd Embodiment. It is a figure explaining the search cost and
interpolation cost of the 2nd hierarchy based on 2nd Embodiment. It is a figure explaining
evaluation conditions. It is a figure explaining the search point in evaluation. It is a figure which
shows an example of the result of having evaluated the error of the transfer function when using
PEE, SD, and SDR when changing azimuth ¦ direction (2), and the linearity of an interpolation
coefficient. It is a figure which shows an example of the average error of the arrival direction
estimation of the sound source by the presence or absence of interpolation. It is a figure which
03-05-2019
6
shows an example of the result of having evaluated calculation cost. It is a figure which shows an
example of the result of audio ¦ voice recognition with respect to the acoustic signal isolate ¦
separated for every sound source.
[0017]
Hereinafter, embodiments of the present invention will be described in detail. In addition, this
invention is not limited to the embodiment which concerns, A various change is possible within
the range of the technical thought.
[0018]
First Embodiment First, an outline of the present embodiment will be described. FIG. 1 is a
diagram for explaining an environment in the direction estimation of a sound source (hereinafter
referred to as a sound source direction estimation). In FIG. 1, the left-right direction of the paper
surface is the Y direction, and the direction perpendicular to the Y direction is the X direction.
Numerals 2a and 2b show a sound source. The sound source direction estimation device (sound
source direction estimation unit) 11 of the present embodiment performs sound source
localization and sound source separation in order to recognize such a plurality of sound sources.
In the example shown in FIG. 1, the sound source 2a is in the direction of the angle a1
counterclockwise with respect to the X axis, and the sound source 2b is in the direction of the
angle a2 clockwise with respect to the X axis.
[0019]
FIG. 2 is a diagram showing an outline of processing of the sound processing apparatus 1
according to the present embodiment. As shown in FIG. 2, the sound processing apparatus 1
includes a sound source direction estimation unit 11, an acoustic feature extraction unit
(Acoustic Feature Extraction) 14, a speech recognition unit (Automatic Speech Recognition) 15,
and a recognition result output unit 16. Configured The sound source direction estimation unit
11 includes a sound source localization unit (Sound Source Localization) 12 and a sound source
separation unit (Sound Source Separation) 13.
[0020]
03-05-2019
7
The sound source localization unit 12 has an acoustic signal input unit, and performs, for
example, Fourier transform on the acoustic signal collected by a plurality of microphones. The
sound source localization unit 12 estimates sound source directions with respect to a plurality of
Fourier-transformed sound signals (hereinafter referred to as sound source localization). The
sound source localization unit 12 outputs information indicating the result of the sound source
localization to the sound source separation unit 13. The sound source separation unit 13
performs sound source separation of the target sound and the noise on the information
indicating the result of the sound source localization input from the sound source localization
unit 12. The sound source separation unit 13 outputs a signal corresponding to each sound
source separated by the sound source to the acoustic feature quantity extraction unit 14. The
target sound is, for example, a sound emitted from a plurality of speakers. Noise is noise other
than the target sound, for example, wind noise, sound emitted by other devices placed in the
room where the sound was collected, and the like.
[0021]
The acoustic feature quantity extraction unit 14 extracts the acoustic feature quantity of the
signal corresponding to each sound source input from the sound source separation unit 13, and
outputs information indicating the extracted acoustic feature quantity to the speech recognition
unit 15. The speech recognition unit 15 performs speech recognition based on the acoustic
feature amount input from the acoustic feature amount extraction unit 14 when the sound
source includes speech uttered by a human being, and recognizes the recognized result as a
recognition result output unit 16 Output to The recognition result output unit 16 is, for example,
a display device, an acoustic signal output device, or the like. The recognition result output unit
16 displays information based on the recognition result input from the speech recognition unit
15 on, for example, a display unit. The sound processing apparatus 1 or the sound source
direction estimation unit 11 may be incorporated in, for example, a robot, a car, an aircraft
(including a helicopter), a portable terminal, or the like. The portable terminal is, for example, a
portable telephone terminal, a portable information terminal, a portable game terminal or the
like.
[0022]
In the present embodiment, the sound source direction is hierarchically estimated in order to
reduce the calculation cost while improving the spatial resolution of the sound source in order to
03-05-2019
8
improve the estimation accuracy of the sound source direction. Note that to estimate the sound
source direction hierarchically, the sound source direction estimation unit 11 (FIG. 2) first divides
the entire predetermined search range into coarse search intervals, searches at the coarse search
intervals, and searches the sound source direction. presume. Next, the sound source direction
estimation unit 11 selects one search interval corresponding to the estimated direction, and sets
the selected search interval as a new search range. Then, the sound source direction estimation
unit 11 divides the new search range into finer search intervals, and estimates the sound source
direction by searching at the fine search intervals. Thus, in the present embodiment, for example,
the next search interval is narrowed from the current search interval to perform the search. As a
result, in the present embodiment, the processing time required to estimate the sound source
direction can be shortened.
[0023]
FIG. 3 is a block diagram of the sound source localization unit 12 according to the present
embodiment. As shown in FIG. 3, the sound source localization unit 12 includes a voice input unit
101, a short time Fourier transform unit 102, a first correlation matrix calculation unit 103, a
noise database 104, a second correlation matrix calculation unit 105, a matrix calculation unit
106, and a hierarchy. Number calculation unit 107, first space spectrum calculation unit 108,
second space spectrum calculation unit 109, peak search unit 110, spatial transfer function (STF)
processing unit 111, transfer function storage unit 112, and output unit 113 Is composed
including.
[0024]
The voice input unit 101 includes M sound collecting means (for example, microphones) (M is an
integer of 2 or more), and the respective sound collecting means are disposed at different
positions. The voice input unit 101 is, for example, a microphone array provided with M
microphones. The voice input unit 101 outputs the sound wave received by each sound
collection unit to the short time Fourier transform unit 102 as an acoustic signal of one channel.
Note that the audio input unit 101 may convert an acoustic signal from an analog signal to a
digital signal, and output the audio signal converted into the digital signal to the short-time
Fourier transform unit 102.
[0025]
03-05-2019
9
The short-time Fourier transform unit 102 performs short-time Fourier transform (STFT) for each
frame in the time domain on the audio signal of each channel input from the voice input unit 101
to obtain Generate an input signal. The short time Fourier transform is a transform that performs
a Fourier transform by multiplying the function while shifting the window function. A frame is a
time interval of a predetermined length (frame length) or a signal included in the time interval.
The frame length is, for example, 10 [msec]. The short time Fourier transform unit 102 generates
a matrix X (ω, f) for each frequency f and frame time f using an input signal subjected to short
time Fourier transform for each frame, and generates the generated input vector X of M columns
(M Output ω and f) to the first correlation matrix calculator 103.
[0026]
The first correlation matrix calculation unit 103 uses the input vector X (ω, f) input from the
short time Fourier transform unit 102 to generate the spatial correlation matrix R Calculate ω, f).
The first correlation matrix calculating unit 103 outputs the calculated spatial correlation matrix
R (ω, f) to the matrix calculating unit 106. The spatial correlation matrix R (ω, f) is a square
matrix of M rows and M columns.
[0027]
[0028]
In equation (1), f represents the current frame time, and TR is the length (number of frames) of
the section used when calculating the spatial correlation matrix R (ω, f).
The length of this section is called the window length. τ is a variable indicating a frame time
(not limited to the current frame time), and is a value in the range of 0 to TR-1. Also, * represents
a complex conjugate transpose operator of a vector or matrix. In equation (1), the spatial
correlation matrix R (ω, f) is smoothed by the TR frame to improve the robustness against noise.
That is, equation (1) is a correlation between channels n (n is an integer of 1 or more) and
channel m (m is an integer of 1 or more different from n). The product of the input signal vector
and the complex conjugate is averaged over the interval of the current frame time 0 to the
window length TR-1.
03-05-2019
10
[0029]
The noise database 104 stores in advance a frequency ω and a noise source matrix N (ω, f) for
each frame time f (hereinafter referred to as a noise matrix). The noise is noise other than the
target sound, for example, wind noise, sound emitted from other devices placed in the room
where the sound is collected, and the like.
[0030]
The second correlation matrix calculation unit 105 reads the noise matrix N (ω, f) stored in the
noise database 104, and uses the read N (ω, f) to calculate the following equation for each
frequency ω and frame time f A noise correlation matrix (hereinafter referred to as a noise
correlation matrix) K (ω, f) is calculated by (2). The noise correlation matrix K (ω, f) is a noise
correlation matrix calculated from the signal to be suppressed (whitened) at the time of
localization, but in this embodiment, it is assumed to be a unit matrix for simplification. The
second correlation matrix calculating unit 105 outputs the calculated noise correlation matrix K
(ω, f) to the matrix calculating unit 106.
[0031]
[0032]
In equation (2), Tk is the length (number of frames) of the section used when calculating the
noise correlation matrix K (ω, f).
τ k is a variable indicating a frame time (not limited to the current frame time), and is a value in
the range of 0 to Tk−1. N <*> represents the complex conjugate transpose operator of the
matrix N. As in equation (2), the noise correlation matrix K (ω, f) is the correlation between the
noise signal of channel n and the noise signal of channel m, and the noise matrix N of channel n
and channel k The product of the above and its complex conjugate is averaged over the window
length Tk-1 from the current frame time 0.
03-05-2019
11
[0033]
The matrix calculation unit 106 uses the noise correlation matrix K (ω, f) input from the second
correlation matrix calculation unit 105 to generate the spatial correlation matrix R (ω, f) input
from the first correlation matrix calculation unit 103. The eigenvector is calculated for each of
the frequency ω and the frame time f in the space where The matrix calculation unit 106
outputs the calculated eigenvectors to the number-of-layers calculation unit 107. The matrix
calculation unit 106 multiplies the spatial correlation matrix R (ω, f) by the inverse matrix K
<−1> (ω, f) of the noise correlation matrix K (ω, f) from the left side, K <−1> (ω F) For R (ω, f),
GSVD (generalized singular-value decomposition) represented by the following formula (3) is
calculated by the MUSIC method. The matrix calculation unit 106 calculates the vector El (ω, f)
and the eigenvalue matrix Λ (ω, f) satisfying the relationship of Expression (3) by the GSVDMUSIC method. By this processing, the matrix calculation unit 106 decomposes the target sound
into a subspace of the target sound and a noise subspace. In the present embodiment, noise can
be whitened by equation (3).
[0034]
[0035]
In equation (3), vector El (ω, f) is left-singular vectors and vector Er <*> (ω, f) is the complex of
right-singular vectors It is conjugate.
The vector E (ω, f) is a vector having eigenvectors e 1 (ω, f),..., E M (ω, f) as element values.
Further, the eigenvectors e1 (ω, f),..., EM (ω, f) are eigenvectors respectively corresponding to
the eigenvalues λ1 (ω, f), ···, λM (ω, f). Here, M is the number of microphones. The eigenvalue
matrix 値 (ω, f) is the following equation (4). In Equation (4), diag represents a diagonal matrix.
[0036]
[0037]
The number-of-layers calculating unit 107 calculates the number of layers to be searched by the
03-05-2019
12
first space spectrum calculating unit 108 to the STF processing unit 111 and the search interval
for searching, and calculates the number of layers and the search interval, and the matrix
calculation The eigenvectors input from unit 106 are output to first space spectrum calculation
unit 108.
This search interval corresponds to spatial resolution.
[0038]
The first space spectrum calculation unit 108 uses the number of layers and the search interval
input from the layer number calculation unit 107, the eigenvectors, and the transfer function A
or the interpolation transfer function A ^ input from the STF processing unit 111, For f, for each
frequency ω and sound source direction 、, the spatial spectrum P (ω, ψ, f) before the
frequency ω is integrated is calculated by the following equation (5). When the interpolation
transfer function A ^ is input from the STF processing unit 111, the first space spectrum
calculation unit 108 performs an operation using the interpolation transfer function A ^ instead
of the transfer function A of equation (5). . The first space spectrum calculation unit 108 outputs
the calculated space spectrum P (ω, ψ, f) and the number of layers input from the number-oflayers calculation unit 107 to the second space spectrum calculation unit 109.
[0039]
[0040]
In equation (5), ¦ ... ¦ indicates an absolute value.
Equation (5) represents the ratio of the spatial spectrum to the component of the overall transfer
function due to noise. In equation (5), em (ω, f) is the left singular vector El (ω, f) (= e1 (ω, f),...,
EM (ω, f)) of equation (3). Also, A <*> is a complex conjugate of the transfer function A. LS is the
number of sound sources, and is an integer of 0 or more. Further, A (ω, ψ) is a measured known
transfer function stored in advance in the transfer function storage unit 112, and is given by the
following expression (6).
03-05-2019
13
[0041]
[0042]
Note that ψi is the direction of the sound source direction measured in advance, that is, the
direction of the transfer function A.
i is an integer of 1 or more. M is the number of microphones. T indicates a transposed matrix.
[0043]
The second space spectrum calculation unit 109 averages the space spectrum P (ω, ψ, f) input
from the first space spectrum calculation unit 108 in the ω direction using the following
equation (7): Calculate (ψ, f). The second space spectrum calculation unit 109 outputs the
calculated averaged space spectrum P (ψ, f) and the number of layers and the search interval
input from the first space spectrum calculation unit 108 to the peak search unit 110.
[0044]
[0045]
In equation (7), ω [k] represents the frequency corresponding to the kth frequency bin.
The frequency bin is a discretized frequency. Also, kh and kl are indices of frequency bins
corresponding to the maximum frequency (upper limit frequency) and the minimum frequency
(lower limit frequency) in the frequency domain. In equation (7), kh−kl + 1 is the number of
spatial spectra P (ω, ψ, f) that are symmetrical of addition (Σ). The reason why 1 is added in
this way is that each frequency ω is discretized, so the space spectrum P (ω [k], ψ, f) related to
the upper limit frequency which is both ends of the frequency band, and the lower limit
frequency The spatial spectrum P (ω [k], ψ, f) pertaining to is all to be added.
03-05-2019
14
[0046]
The peak search unit 110 receives the averaged spatial spectrum P (算出, f), the number of
layers, and the search interval from the second spatial spectrum calculation unit 109. The peak
search unit 110 detects an azimuth ψ <[l]> (1 is a value in a range of 1 or more and Ls or less)
which is a peak value of the input averaged spatial spectrum P (ψ, f). The peak search unit 110
determines whether or not the estimation process of the first space spectrum calculation unit
108 to the STF process 111 has ended according to the number of layers input from the second
space spectrum calculation unit 109. When the peak search unit 110 determines that the
estimation process is completed for the number of layers, the peak search unit 110 outputs the
detected peak value ψ <[l]> to the output unit 113 as an estimated azimuth ψ. When the peak
search unit 110 determines that the estimation process has not been completed for the number
of layers, the peak search unit 110 outputs the detected peak value ψ <[l]>, the number of
layers, and the search interval to the STF selection unit 111.
[0047]
The STF processing unit 111 reads the transfer function A from the transfer function storage unit
112 using the peak value ψ <[l]>, the number of layers, and the search interval input from the
peak search unit 110, or the interpolation transfer function A Calculate ^. Specifically, it is
determined whether the transfer function A corresponding to the direction to be searched is
stored in the transfer function storage unit 112 or not. When it is determined that the transfer
function A corresponding to the direction to be searched is stored in the transfer function
storage unit 112, the STF processing unit 111 reads the corresponding transfer function A from
the transfer function storage unit 112. When it is determined that the transfer function A
corresponding to the direction to be searched is not stored in the transfer function storage unit
112, the STF processing unit 111 calculates a corresponding interpolation transfer function A ^.
The STF processing unit 111 outputs the read transfer function A or the calculated interpolation
transfer function A ^ to the first space spectrum calculation unit 108. When the search is
completed with respect to a predetermined search range (also referred to as one hierarchy), the
STF processing unit 111 describes a new search range and a search interval based on the
direction in which the peak value is detected. Calculate as follows. The STF processing unit 111
outputs the calculated new search range and search interval to the first space spectrum
calculation unit 108.
[0048]
03-05-2019
15
The output unit 113 outputs, for example, the estimated azimuth ψ input from the peak search
unit 110 to the sound source separation unit 13 (see FIG. 2). Also, for example, when only the
sound source direction estimation unit 11 is attached to the robot, the output unit 113 may be a
display device (not shown). In this case, the output unit 113 may display the estimated azimuth
ψ on the display unit as character information or as illustrated.
[0049]
As described above, the sound source direction estimation device of the present embodiment
includes a transfer function storage unit (112) that stores the transfer function from the sound
source for each direction of the sound source, and a desired search range for searching the
direction And the calculation unit (layer number calculation unit 107) for calculating the number
of layers to be searched and the search interval for each layer based on the desired spatial
resolution and the search range by using the transfer function for each search interval A sound
source localization unit that estimates the direction of the sound source by estimating the
direction of the sound source based on the calculated result, updating the search range and the
search interval to the number of layers calculated by the calculator based on the estimated
direction And a peak search unit 110 and an STF processing unit 111). With such a
configuration, according to the present embodiment, the number-of-layers calculation unit 107
calculates the number of layers to be searched and the search interval. Next, the sound source
localization unit 12 (FIG. 3) divides the entire predetermined search range into coarse search
intervals, and searches at the coarse search intervals to estimate the sound source direction.
Next, the sound source localization unit 12 selects one search interval corresponding to the
estimated direction, and updates the search range with the selected search interval as a new
search range. Then, the sound source localization unit 12 divides the new search range into
smaller search intervals, updates the search range, and estimates the sound source direction by
performing the search at the fine search intervals. As a result, in the present embodiment, the
processing time required to estimate the sound source direction can be shortened.
[0050]
FIG. 4 is a block diagram of the sound source separation unit 13 according to the present
embodiment. As shown in FIG. 4, the sound source separation unit 13 is configured to include a
first cost calculation unit 121, a second cost calculation unit 122, and a sound source separation
processing unit 123. The first cost calculation unit 121 calculates a cost function JGC (W) using a
method of sound source separation (GSS (geometrically constrained sound source separation))
03-05-2019
16
based on constraint conditions, and calculates the cost function JGC (W) as a sound source It is
output to the separation processing unit 123. The first cost calculation unit 121 realizes the
geometric constraint by designating a transfer function as D of the cost function JGC (W)
expressed by the following equation (8). The cost function JGC is a degree of geometric
constraint, and is used to calculate the separation matrix W.
[0051]
[0052]
However, in Formula (8), EGC is following Formula (9).
[0053]
[0054]
In equation (9), W is a separation matrix and D is a transfer function matrix.
In the present embodiment, by using the interpolation transfer function A ^ interpolated by the
STF processing unit 111 or the transfer function A read out to the transfer function matrix D, it is
possible to apply geometric constraint to the correct sound source direction.
[0055]
The second cost calculation unit 122 calculates the cost function JHDSS (W) using the method of
sound source separation (HDSS) based on high-dimensional decorrelation as an extension of the
independent component analysis, and calculates the cost function JHDSS (W ) Is output to the
sound source separation processing unit 123.
[0056]
The sound source separation processing unit 123 uses the cost function JGC (W) input from the
first cost calculation unit 121 and the cost function JHDSS (W) input from the second cost
calculation unit 122 to perform the following equation (10) The cost function JGHDSS (W) is
03-05-2019
17
calculated as follows.
That is, in the present embodiment, a method in which the GC method and the HDSS method are
integrated is used.
In the present invention, the method integrated in this way is referred to as a GHDSS (Geometric
High-order Dicorrelation-based Source Separation) method.
[0057]
[0058]
In equation (10), α is a scalar and is an integer of 0 or more and 1 or less.
Further, in the equation (10), the cost function JHDSS (W) is expressed as the following equation
(11).
[0059]
[0060]
In equation (11), E [•] indicates the expected value, and the bold letter E indicates the vector E.
Eφ is a cost function used as a substitute for the correlation matrix E in DSS (Dicorrelation-based
Source Separation).
Moreover, in Formula (11), EHDSS is Formula (12).
[0061]
03-05-2019
18
[0062]
In equation (12), bold y represents a vector.
In addition, it defines as vector E = yy <H> -I. The vector I is an identity matrix. Moreover, the
code ¦ symbol H has shown the complex conjugate. Further, φ (y) is a non-linear function, which
is the following equation (13).
[0063]
[0064]
In equation (13), p (yi) is a joint probability density function (pdf) of y.
In addition, although (phi) (yi) can be variously defined, you may use a hyperbolic-tangent
(hyperbolic-tangent-based) function like following Formula (14) as an example of (phi) (yi).
[0065]
[0066]
In equation (14), η is a scaling parameter.
[0067]
The sound source separation processing unit 123 adaptively calculates the separation matrix W
so as to minimize the cost functions JGC and JHDSS according to the following equation (15).
The sound source separation processing unit 123 separates the multi-channel sound signal input
03-05-2019
19
to the sound input unit 101 (see FIG. 3) into components for each sound source based on the
separation matrix W thus estimated.
The sound source separation processing unit 123 outputs the separated components of each
sound source to the acoustic feature quantity extraction unit 14, for example.
[0068]
[0069]
In equation (15), t is, μHDSS is the step size used when updating the error matrix, and μGC is
the step size used when updating the geometric error matrix.
Further, J'HDSS (Wt) is an HDSS error matrix, and is a number sequence obtained by
differentiating the matrix JHDSS for each element of the input. J'GC (Wt) is a geometric error
matrix, and is a number sequence obtained by differentiating the matrix JGC for each element of
the input. The step size μ GC is the following equation (16) using EGC and the geometric error
matrix J ′ GC (Wt). The step size μHDSS is the following equation (17) using EHDSS and the
geometric error matrix J'GHDSS.
[0070]
[0071]
[0072]
As described above, the sound source separation unit 13 of the present embodiment performs
sound source separation by sequentially calculating the separation matrix W using the GDHSS
method.
In the present embodiment, an example in which the sound source separation unit 13 performs
03-05-2019
20
sound source separation using the GDHSS method has been described, but sound source
separation may be performed using a known BSS (blind source separation) method or the like.
[0073]
Next, hierarchical search processing performed by the number-of-tiers calculation unit 107 to the
STF processing unit 111 will be described.
FIG. 5 is a flowchart showing the procedure of hierarchical search processing according to the
present embodiment. FIG. 6 is a diagram for explaining the procedure of hierarchical search
processing according to the present embodiment. As shown in FIG. 6, the search range is from 0
to d0. The specific search range is, for example, 0 degrees to 180 degrees. The search interval of
the first layer is d1, and the search interval of the second layer is d2. The desired spatial
resolution by the user of the sound source direction estimation device 11 is the search interval
dS of the Sth layer. Although FIG. 6 shows the case where the search interval of each layer is 4 in
order to simplify the description, the number of search intervals is not limited to this. Further,
symbols p1 to p5 indicate measurement points at which the transfer function is measured in
advance. As a specific example, the measurement point p1 is 0 degrees, the measurement point
p2 is 30 degrees, the measurement point p3 is 90 degrees, the measurement point p4 is 130
degrees, and the measurement point p5 is 180 degrees. Further, reference numerals q11 to q15
denote search points in the first layer. As a specific example, the search point q11 is 0 degrees,
the search point q12 is 45 degrees, the search point q13 is 90 degrees, the search point q14 is
135 degrees, and the search point q15 is 180 degrees. Note that transfer functions A (ω, ψp1)
to A (ω, ψp4) corresponding to the azimuths p1, p2, p3 and p4 are stored in the transfer
function storage unit 112 in advance.
[0074]
(Step S1) The user of the sound source direction estimation device 11 selects a desired spatial
resolution. The number-of-tiers calculation unit 107 uses the following equation (18) and the
following equation (19) for the number of layers S and the search interval δi for searching based
on the desired spatial resolution and search range selected by the user of the apparatus. The
calculated number of layers S, the search interval δi, and the eigenvectors input from the matrix
calculation unit 106 are output to the first spatial spectrum calculation unit 108. The calculation
of the number of layers S and the search interval δi will be described later. After step S1 ends,
the process proceeds to step S2.
03-05-2019
21
[0075]
[0076]
[0077]
In equation (19), ds is a search interval of the s-th layer.
However, s is an integer of 1 or more and S or less.
[0078]
(Step S2) The STF processing unit 111 determines whether the transfer function A (ω, ψ)
corresponding to the first search point (q11 in FIG. 6) is stored in the transfer function storage
unit 112.
If the STF processing unit 111 determines that the transfer function A (ω, ψ) corresponding to
the first search point is stored in the transfer function storage unit 112 (step S2; Yes), the
process proceeds to step S3. If it is determined that the transfer function A (ω, ψ) corresponding
to the search point is not stored in the transfer function storage unit 112 (step S2; No), the
process proceeds to step S4.
[0079]
(Step S3) When it is determined that the transfer function A (ω, ψ) corresponding to the first
search point is stored in the transfer function storage unit 112, the STF processing unit 111
stores the transfer function A in the transfer function storage unit 112 The transfer function A
(ω, ψ) corresponding to the first search point (q11 in FIG. 6) is read, and the read transfer
function A (ω, ψ) is output to the first spatial spectrum calculation unit 108. After the step S3
ends, the process proceeds to the step S5.
03-05-2019
22
[0080]
(Step S4) When it is determined that the transfer function A (ω, ψ) corresponding to the first
search point is not stored in the transfer function storage unit 112, the STF processing unit 111
determines that the two adjacent to the first search point The interpolation transfer function A ^
of the first search point is determined by interpolation using the transfer function corresponding
to the measurement point. As an example, in the case of the search point q12, the interpolation
transfer function A ^ of the search point q12 is calculated using the transfer function A of the
two measurement points p2 and p3 adjacent to this point. The STF processing unit 111 may use,
for example, the FDLI method, the TDLI method, the FTDLI method, or the like for the
interpolation. Each interpolation method will be described later.
[0081]
(Step S5) The first space spectrum calculation unit 108 uses the transfer function A input from
the STF processing unit 111 or the interpolation transfer function A ^ to obtain the space
spectrum P (ω, ψ,. f) Calculate. Next, the second space spectrum calculation unit 109 uses the
space spectrum P (ω, ψ, f) calculated by the first space spectrum calculation unit 108 to
calculate an averaged space spectrum P (ψ , F) and outputs information indicating the calculated
averaged spatial spectrum P (ψ, f) and the number of the search point to the peak search unit
110. The search point number is a number assigned corresponding to the search points q11 to
q15. After the step S5 ends, the process proceeds to the step S6.
[0082]
(Step S6) The peak search unit 110 determines whether the search of all the search points in the
search range is completed. If the peak search unit 110 determines that the search for all the
search points in the search range is completed (Step S6; Yes), the process proceeds to Step S8,
and it is determined that the search for all the search points in the search range is not completed.
If it is determined (Step S6; No), the process proceeds to Step S7.
[0083]
03-05-2019
23
(Step S7) If it is determined that the search of all the search points in the search range is not
completed, the peak search unit 110 outputs information indicating an instruction to search for
the next search point to the STF processing unit 111. The next search point is, for example, when
the process of the search point q11 is completed in step S5, the search point q12 is the next
search point. The search may be performed from 0 to d0 or may be performed from d0 to 0.
[0084]
(Step S8) If it is determined that the search of all the search points in the search range is
completed, the peak search unit 110 determines from among all the averaged space spectra P
(ψ, f) calculated in the search range. The azimuth ψ <[i]> which is the maximum value is
extracted. After completion of step S8, the process proceeds to step S9.
[0085]
(Step S9) The peak search unit 110 determines whether the search has ended in all the layers
calculated in step S1. If the peak search unit 110 determines that the search has ended in all the
layers (step S9; Yes), the process proceeds to step S10, and if it is determined that the search has
not ended in all the layers (step S9; No) , And proceeds to step S11.
[0086]
(Step S10) The peak search unit 110 outputs the extracted direction ψ <[i]> to the output unit
113 as the estimated direction ψ ^. Next, the output unit 113 outputs, for example, the
estimated azimuth ψ ^ input from the peak search unit 110 to the acoustic feature quantity
extraction unit 14 (see FIG. 2), and ends the process of sound source direction estimation.
[0087]
(Step S11) When the STF processing unit 111 determines that the search is not completed in all
the layers calculated in step S1, two search points adjacent to the azimuth ψ <[i]> having the
peak value ((step S11) ψ <[i]> − δi and the search point (ψ <[i]> + δi) are selected as a section
03-05-2019
24
to be searched next. Hereinafter, ψ <[i]> − δi is expressed as ψ <[i −]>, and ψ <[i]> + δi is
expressed as ψ <[i +]>. As an example, in FIG. 6, when the search point at which the peak value is
detected is q13, the STF processing unit 111 selects the search point q12 and the search point
q14 as two adjacent search points. In this case, the search range of the second layer is the search
points q12 to q14, and the width of the search range is 2d1. After completion of step S10, the
process proceeds to step S11.
[0088]
(Step S12) The STF processing unit 111 calculates the search interval d used in the search of the
second layer. The calculation of the search interval d in each layer will be described later. After
the step S12 ends, the process returns to the step S2. The first space spectrum calculation unit
108 to the STF processing unit 111 repeat steps S2 to S9 in the second layer, and use the
transfer function (or interpolation transfer function) of each interval to obtain an averaged space
spectrum P (ψ, f ) And estimate the azimuth ψ having a peak value. After the search of the
second hierarchy is completed, the STF processing unit 111 selects two search points adjacent to
the search point having the peak value, and calculates a search interval d to be used in the next
hierarchy. Thereafter, the first space spectrum calculation unit 108 to the STF processing unit
111 perform sound source direction estimation by repeating steps S2 to S12 for all the layers
calculated in step S1.
[0089]
Next, the interpolation performed by the STF processing unit 111 will be described. The STF
processing unit 111 interpolates the transfer function A using any one of (1) FDLI method, (2)
TDLI method, and (3) FTDLI method described below, for example, to calculate an interpolation
transfer function A ^. Generate In addition, each transfer function A in two measured points
measured is represented like following Formula (20) and following Formula (21). These transfer
functions A are stored in advance in the transfer function storage unit 112. The transfer function
A (ω, ψ1) is the transfer function in the direction ψ1, and the transfer function A (ω, ψ2) is the
transfer function in the direction ψ2.
[0090]
[0091]
03-05-2019
25
[0092]
(1) FDLI (Frequency Domain Linear or Bi-Linear Interpolation) Method In the FDLI (linear
interpolation in the frequency domain) method, the STF processing unit 111 uses the following
equation (22) to make a linear relationship between two measurement points. Interpolation is
performed to calculate an interpolation transfer function A ^.
[0093]
[0094]
In Formula (22), D <A> is an interpolation coefficient, and is a value of 0 or more and 1 or less.
The FDLI method is characterized in that the phase can be linearly interpolated.
[0095]
(2) TDLI (Time Domain Linear Interpolation) Method In TDLI (linear interpolation in the time
domain) method, it is expressed as in the following equation (23).
[0096]
[0097]
However, in equation (23), dψ ^ is the following equation (24).
[0098]
In equations (23) and (24), kψ1 and kψ2 are geometrically determined coefficients, dψ1 and
dψ2 are geometrically determined time delays, and am (t, ψ) is Am (t) It is an expression in the
time domain of ω, ψ).
03-05-2019
26
If equation (23) is regarded as amplitude interpolation and equation (24) is regarded as phase
interpolation, the TDLI method in the frequency domain is expressed as the following equation
(25).
[0099]
[0100]
As described above, in the TDLI method, the STF processing unit 111 performs linear
interpolation between two measurement points using the following equation (25) to calculate an
interpolation transfer function A ^.
The TDLI method is characterized in that the amplitude can be linearly interpolated.
[0101]
(3) FTDLI (Frequency Time Domain Linear or Bi-Linear Interpolation) Method The FTDLI method
is a linear interpolation method combining the FDLI method and the TDLI method described
above.
The FTDLI method obtains the phase from linear interpolation in the frequency domain, and
obtains the amplitude from linear interpolation in the time domain.
In the FTDLI method, the STF processing unit 111 performs linear interpolation between two
measurement points using the following equation (26) to calculate an interpolation transfer
function A ^.
[0102]
[0103]
03-05-2019
27
Next, the calculation procedure of the interpolation transfer function A ^ in the FTDLI method
will be described.
(I) First, the interpolation transfer function is calculated using Equation (22) and Equation (25).
In the following description, each obtained interpolated transfer function is expressed as the
following equation (27) and the following equation (28).
[0104]
[0105]
[0106]
(II) Next, Formula (27) and Formula (28) are decomposed into phase and amplitude as Formula
(29) and Formula (30), respectively.
[0107]
[0108]
[0109]
The interpolation transfer function A ^ (ω, ψ ^) is expressed by the equation (26) described
above by the equation (29) and the equation (30).
The FTDLI method is characterized by the ability to linearly interpolate phase and amplitude.
[0110]
03-05-2019
28
Next, the procedure for calculating the optimal hierarchy value and the search interval
performed by the hierarchy number calculation unit 107 in step S1 described above will be
described.
FIG. 7 is a diagram for explaining the calculation procedure of the hierarchy value according to
the present embodiment.
In the following description, the left side of the search range is set to 0 for convenience in each
layer.
Also, in each layer, for convenience, it is assumed that there is a peak value in any search interval
d1 between the search point and the search point, for example, in the first layer.
[0111]
In the following description, known values are the search range d0 of the first layer and the
search interval dS of the Sth layer.
Further, the values to be obtained are the number of layers S in which the number of searches is
minimum and the search interval ds in each layer.
First, the interval between the directions ψ1 and ψ1 of the two transfer functions is set to d0.
That is, when searching the range of interval 0 to interval d0 between the direction ψ1 and the
direction ψ1 of the two transfer functions at the interval dS, the number of searches is d0 / dS.
In order to minimize the number of searches and reduce the calculation cost, in the present
embodiment, searches are hierarchically performed as follows. As shown in FIG. 7, the number of
layers is S, and the interval of the s-th layer is ds. However, s is 1 or more and S or less. That is,
the interval of the first layer is d1, and the interval of the second layer is d2 (where d2 is smaller
than d1). In the layering, first, the search of the s-th layer is performed, and one interval ds
including the peak value is selected by the search. Next, the selected interval ds is set as the (s +
1) th layer. In the (s + 1) -th layer, the search is performed with an interval of ds + 1. In the
03-05-2019
29
search of the (s + 1) th layer, one interval ds + 1 including the peak value is selected, and the
selected interval ds + 1 is treated as the (s + 2) th layer. Such processing is called hierarchy.
Moreover, in the hierarchy, the interval of the upper hierarchy is coarse, and the interval of the
lower hierarchy becomes finer. Then, in the S-th layer, since the interval is dS and the number of
intervals is dS-1, the number of searches is (dS-1) / dS. The distance ds is hereinafter also
referred to as the particle size.
[0112]
(When the Number of Layers is Two) As shown in FIG. 7, the total number of searches F (d1) in
the case of searching up to the second layer (S = 2) is represented by the following expression
(31).
[0113]
[0114]
In equation (31), the variable is only d1.
Therefore, when equation (31) is partially differentiated with variable d1 to obtain the minimum
value of equation (31), equation (32) is obtained.
[0115]
[0116]
Therefore, the interval d1 at which equation (32) becomes 0 is √ (d0dS).
In the case of the value of the interval d1, the number of searches in the first layer is the
following expression (33), and the number of searches in the second layer is the following
expression (34).
03-05-2019
30
[0117]
[0118]
[0119]
The total number of searches can be minimized by equalizing the number of searches in each
layer according to equations (33) and (34).
[0120]
(When the Number of Layers is s) Next, when the number of layers is s, conditions for minimizing
the total number of searches will be described.
Even if the number of layers is s, the total number of searches can be minimized by equalizing
the number of searches in each layer.
In the following, the reasoning method is explained using a contradiction method.
When the number of layers is s, the total number of searches is d0 / d1 + d1 / d2 +... + D (s-1) /
ds.
It is assumed that there is a hierarchy in which the total number of searches is the smallest and d
(i-1) / di ≠ di / d (i + 1). However, i is an integer of 1 or more and S or less. In two layers i-1 to i
+ 1, when the search number d (i-1) / di = di / d (i + 1) of each layer, as described in the example
of the two layers (d (i-) This assumption is contradictory, as 1) / di) + (di / d (i + 1)) is minimized.
That is, there will be no hierarchy of d (i-1) / di ≠ di / d (i + 1). As a result, regardless of the
number of layers, when the number of searches in each layer is equal, the total number of
searches can be minimized. Therefore, when the number of layers is s, the condition for
minimizing the total number of searches is d0 / d1 = d1 / d2 =... = D (s-1) / ds. If this conditional
expression is modified, it is expressed as the following expression (35).
[0121]
03-05-2019
31
[0122]
From equation (35), the particle size dS-1 at which the total number of searches is minimized is
the following equation (36).
[0123]
[0124]
(When the number of layers is S) Next, assuming that the number of layers is S, the total number
of searches G (S) is calculated according to d0 / d1 = d1 / d2 =... = D (S-1) / dS It becomes
following Formula (37).
[0125]
[0126]
Next, in order to obtain S which minimizes the equation (37), partial differentiation of the
equation (37) by S results in the following equation (38).
[0127]
[0128]
The total number of searches is minimized when equation (38) is zero.
For this reason, it is S = log (d0 / dS) that the total number of searches is minimized.
The particle size ds of each layer can be calculated by the following equation (39) by modifying
the conditional equation in the same manner as the equation (35) and the equation (36) and
substituting S = log (d0 / dS) into the transformed line form .
03-05-2019
32
[0129]
[0130]
As described above, the number-of-tiers calculation unit 107 (see FIG. 3) calculates the number
of tiers S for which the total number of searches is the smallest using S = log (d0 / dS), and uses
Equation (39). The interval (granularity) dS at which the total number of searches becomes
minimum is calculated.
[0131]
As described above, in the present embodiment, a broad range search is first performed with
coarse resolution and sound source search, and a narrow range search is performed at finer
intervals based on the search result.
According to the present embodiment, the number of searches can be reduced by hierarchically
performing such a search, so that the calculation cost can be reduced while maintaining the
estimation performance of the sound source localization.
[0132]
Second Embodiment In the first embodiment, an example has been described in which the
number-of-tiers calculation unit 107 calculates the number of tiers S for which the total number
of searches is minimum and the granularity ds.
In this embodiment, an example is described in which interpolation is performed between the
direction と 1 and the direction ψ2 of the transfer function, and the search is also performed in
consideration of the interpolated direction.
[0133]
FIG. 8 is a diagram for explaining the calculation procedure of the number of layers and the
interval according to the present embodiment according to the present embodiment.
03-05-2019
33
The difference from FIG. 7 is that the search cost is used.
The search cost is the cost of the search.
Also, let ct be the calculation cost for searching for one search point.
Further, among the search points, there are search points that require interpolation, and the
calculation cost for the interpolation of one point is taken as CI.
In addition, a search point is a point for every space ¦ interval d1 in a 1st hierarchy in FIG. The
calculation cost due to layering is the sum of the cost for interpolation (hereinafter referred to as
interpolation cost) in addition to the search cost. The search cost is a value obtained by
multiplying the number of searches by the calculation cost ct required to search one search
point.
[0134]
In the following description, known values are the search range d0 of the first layer and the
search interval dS of the Sth layer. Further, the values to be obtained are the number of layers S
at which the calculation cost is minimum and the search interval ds in each layer. As shown in
FIG. 8, the search cost of the first layer is (d0 / d1) ct, the search cost of the second layer is (d1 /
d2) ct, and the search cost of the Sth layer is Is (dS-1 / dS) ct. In the present embodiment, the
number-of-tiers calculation unit 107 calculates the number of tiers S that minimizes the
calculation cost due to tiering and the granularity ds.
[0135]
FIG. 9 is a diagram for explaining the search cost and the interpolation cost of the s-th layer
according to the present embodiment. First, the number of searches in the s-th layer is ds−1 /
ds, and the search cost is (ds−1 / ds) ct. The search cost (ds-1 / ds) ct is a fixed value.
03-05-2019
34
[0136]
Next, the interpolation cost of the s-th layer will be described. In the s-th layer, the interval
(granularity) is ds, and the search point is ds-1. Interpolation is assumed to be required at IS
points among the search points. In FIG. 8, search points indicated by black circles indicate points
that need not be interpolated, and search points indicated by white circles indicate points that
need to be interpolated. The interpolation cost in this case is IScI. Next, if all search points do not
need to be interpolated, the interpolation cost is zero. Next, if all the search points need to be
interpolated, the interpolation cost is (ds-1 / ds) cI. Therefore, the range of interpolation cost
range IscI of the s-th layer is 0 or more and (ds-1 / ds) cI or less. Here, cs is 0 or more and cI or
less, and the interpolation cost is a fixed value of (ds−1 / ds) cs.
[0137]
Therefore, since the calculation cost of the s-th layer is the sum of the search cost and the
interpolation cost, it is expressed by the following equation (40).
[0138]
[0139]
For this reason, the calculation cost G <˜> (S) in all the layers is expressed by the following
equation (41).
[0140]
[0141]
In formula (41), i is 1 or more and S or less.
Further, in the case of only the search cost in Equation (41), it corresponds to ci = 0.
Further, ct is set to 1 so that the total number of searches G (S) in the equation (37) described in
03-05-2019
35
the first embodiment and G <〜> (S) in the case of ci = 0 become equal. Normalize.
By normalization of ct, ct + ci in the right side of equation (41) is represented as 1 + ci.
This (1 + ci) is newly set as a variable Ci. However, the variable Ci is 1 or more. By replacing ct +
ci with the variable Ci, the equation (41) becomes the following equation (42).
[0142]
[0143]
(When the Number of Layers is Two) First, the case where the number of layers is two will be
described.
The total number of searches F (d1) in this case is (d0 / d1) C1 + (d1 / dS) CS. In the total
number of searches F (d1), only d1 is a variable. In order to calculate the minimum value of the
total number of searches F (d1), partial differentiation of the total number of searches F (d1) by
d1 results in the following equation (43).
[0144]
[0145]
The d1 at which the equation (43) becomes 0 is √ (C1d0dS / CS).
The number of searches in the first layer at this time is the following equation (44), and the
number of searches in the second layer is the following equation (45).
[0146]
03-05-2019
36
[0147]
[0148]
From the equation (44) and the equation (45), the calculation cost from d0 to dS can be
minimized by equalizing the interval (granularity) weighted by C1 and CS.
Similarly, when the number of layers is s, the condition for minimizing the calculation cost is that
the calculation costs (di-1 / di) Ci of the layers weighted by Ci are equal.
However, i is 1 or more and s or less.
[0149]
(When the number of layers is S) Next, when the number of layers is S, the calculation cost G <˜>
(S) of equation (42) can be expressed as (d0 / d1) C1 = (d1 / d2) C2 = · · · Determined under the
condition of = (d (S-1) / dS) CS. If this conditional expression is modified, it is expressed as the
following expression (46).
[0150]
[0151]
From equation (46), the particle size dS-1 at which the calculation cost is minimized is the
following equation (47).
[0152]
[0153]
From the equation (47), the calculation cost G <˜> (S) becomes the following equation (48).
03-05-2019
37
[0154]
[0155]
Next, in order to obtain S which minimizes the equation (48), partial differentiation of the
equation (48) by S yields the following equation (49).
[0156]
[0157]
S where equation (49) becomes 0 is equation (50).
[0158]
[0159]
Therefore, when the number of layers is S, the interval (grain size) ds at which the calculation
cost is minimum is the following equation (51).
[0160]
[0161]
As described above, the hierarchy number calculation unit 107 (see FIG. 3) calculates the
hierarchy number S at which the calculation cost is minimum using equation (50), and the
calculation cost is minimum using equation (51). Calculate the interval (particle size) ds.
[0162]
[Experimental Results] Next, the sound source direction estimation apparatus 11 according to the
first embodiment or the second embodiment is implemented on the open source robot hearing
software HARK (HRI-JP Audition for Robots with Kyoto University) and evaluated. Explain the
results.
03-05-2019
38
Moreover, in the following evaluation, the sound source direction estimation apparatus 11 of 1st
Embodiment was attached to a humanoid robot, and was evaluated.
FIG. 10 is a diagram for explaining the evaluation conditions.
As shown in FIG. 10, the evaluation condition is that the sampling frequency is 16 kHz, the
number of FFT points is 512, the shift length is 160, the number of microphones is eight, and the
microphones are in a circular array state Installed.
The evaluation condition is that the number of speakers is 4, the distance to the speaker is 1.5 m,
and the size of the evaluated room is 7 x 4 m. The reverberation time of the room where I went is
0.2 [sec].
[0163]
FIG. 11 is a diagram for explaining a search point in evaluation.
In FIG. 11, the left-right direction of the drawing is the Y direction, and the direction
perpendicular to the Y direction is the X direction.
Reference numeral 301 denotes a robot to which the sound source direction estimation device
11 is attached, and reference numerals 311 to 314 denote speakers, ie, sound sources.
The sound source 311 is at a position of −60 degrees clockwise about the X axis, the sound
source 312 is at a position of −20 degrees clockwise about the X axis, and the sound source 313
is on the X axis , And the sound source 314 is positioned at 60 [deg] counterclockwise with
respect to the X-axis.
[0164]
03-05-2019
39
As for the search point, the azimuth ψ1 was fixed at 0 degrees, and the azimuth ψ2 was
evaluated using 30, 60, 90 and 120 degrees.
Also, the interpolation point between azimuths is 1 degree, and the error e <-> [ψ1, ψ2] with the
1 degree transfer function obtained by measurement is calculated by using the following
equation (52) Calculated.
[0165]
[0166]
In equation (52), f (ω [k], ψ ^ [i]) is an interpolation error of a frequency bin corresponding to
the frequency ω [k] at the interpolation point ψ ^ [i].
In addition, in evaluation, kl and kh of Formula (52) were selected so that the frequency (omega)
which is a frequency low region used by sound source localization might be in the range of 500
[Hz] or more and 2800 [Hz] or less.
In equation (52), i ψ is the number of interpolation points per degree in which 範 囲 is larger
than ψ 1 and smaller than ψ 2.
In the following evaluation, phase estimation error (PEE), spectral distortion (SD), and signal-todistortion ratio (SDR) were used as error indices of f (ω [k], ψ ^ [i]).
PEE is expressed by the following equation (53) and is a phase error index.
[0167]
[0168]
SD is expressed by the following equation (54) and represents an error amplitude.
03-05-2019
40
[0169]
[0170]
SDR is expressed by the following equation (55) and indicates the error of the transfer function
itself.
[0171]
[0172]
Hereinafter, let e1 <->, e2 <->, and e3 <-> be e <-> [ψ1, ψ2] when using PEE, SD, and SDR,
respectively.
Further, the linearity of the interpolation coefficient DA was evaluated using the following
equation (56).
[0173]
[0174]
In the equation (56), the interpolation coefficient DA [i] is a value at which the interpolation error
is minimized in DA.
In equation (56), the fact that the interpolation coefficient DA is close to linear with respect to the
interpolation point means that the interpolation coefficient DA can be determined by (ψ ^ −ψ2)
/ (ψ1−ψ2), which means that it is practical. There is.
Hereinafter, d1 <−>, d2 <−>, and d3 <−> will be d <−> [ψ1, ψ2] when using PEE, SD, and
SDR, respectively.
03-05-2019
41
[0175]
FIG. 12 is a diagram showing an example of the evaluation result of the error of the transfer
function and the linearity of the interpolation coefficient when using PEE, SD, and SDR when the
azimuth ψ2 is changed.
FIG. 12 is an example of the result of evaluation using each index for the FDLI method, the TDLI
method, and the FTDLI method.
In FIGS. 12A to 12F, the horizontal axis is a relative angle, and the vertical axis is an average
value of errors.
Fig.12 (a) is an example of the evaluation value of e1 <->, FIG.12 (b) is an example of the
evaluation value of e2 <->, FIG.12 (c) is e3 <-> It is an example of an evaluation value.
FIG.12 (d) is an example of evaluation value of d1 <->, FIG.12 (e) is an example of evaluation
value of d2 <->, and FIG.12 (f) is d3 <->. It is an example of an evaluation value.
[0176]
As shown in FIGS. 12 (a) to 12 (c), in the case of using the FTDLI method of the present
invention, the FDLI method and the TDLI method are used in all of e1 <-,> e2 <-> and e3 <->.
There is less error compared to using. Further, as shown in FIGS. 12 (d) to 12 (f), in the case of
using the FTDLI method also in the linearity of DA, in all of d1 <-,> d2 <-> and d3 <-> There is less
error as compared with the FDLI method and the TDLI method. This means that, even in the
linearity of DA, the use of the FTDLI method is most nearly linear.
[0177]
FIG. 13 is a diagram showing an example of the average error of the direction of arrival
estimation of the sound source depending on the presence or absence of interpolation. FIG. 13 is
03-05-2019
42
an example of the result of evaluation of the average error of the estimation of the direction of
arrival using no interpolation (NONE), the FDLI method, the TDLI method, and the FTDLI method.
In FIG. 13, the horizontal axis is the measurement interval (relative angle) of the transfer
function, and the vertical axis is the average error of the arrival direction estimation. In addition,
evaluation reproduces white noise every 1 degree (however, however, the direction ψ is a range
of -90 degrees or more and 90 degrees or less), and calculates an error between the reproduction
direction of the reproduced sound source and the estimation direction. evaluated. In the
evaluation, change the measurement interval of the transfer function so that the azimuth ψ is 1
degree, 5 degrees, 10 degrees, 30 degrees, 60 degrees, 90 degrees, and 120 degrees, and further
interpolate the transfer function of 1 degree interval Generated.
[0178]
As shown in FIG. 13, at any relative angle, the method in the case of performing any interpolation
has a smaller average error than in the case of not performing the interpolation. Furthermore,
the average error is smaller when the FTDLI method is used than when the interpolation is not
performed, the FDLI method or the TDLI method is used and the interpolation is performed.
Thus, according to the present invention, transfer functions between intervals are generated by
interpolation using transfer functions generated and stored in advance at intervals of 30 degrees,
whereby transfer functions at intervals of one degree can be obtained. The same accuracy was
obtained as using the source direction estimation.
[0179]
Next, an example of the result of evaluating the calculation cost due to the hierarchical search
will be described. The evaluation changed the number of sound sources, and when processing for
1000 frames was performed, the average processing time concerning calculation of sound source
localization (formula (3)-formula (7)) was computed. In addition, the hierarchy of sound source
search was made into 2 hierarchy, and after searching a 10-degree space ¦ interval in a 1st
hierarchy, a 1-time space ¦ interval search was performed in a 2nd hierarchy. Moreover, the
sound source was evaluated about the case of one piece, two pieces, three pieces, and four
pieces. FIG. 14 is a diagram showing an example of the result of evaluation of the calculation
cost. In FIG. 14, the second row shows the operation time when the search is performed without
hierarchy (without hierarchy), and the third row is when the search is hierarchical with the
present invention (with hierarchy) The calculation time of
03-05-2019
43
[0180]
As shown in FIG. 14, when there is one sound source, the computation time in the case of no
layering is 24.1 [msec], and the computation time in the case of layering is 6.8 [msec]. When
there are two sound sources, the calculation time in the case of no hierarchy is 22 [msec], and
the calculation time in the case of hierarchy is 7.7 [msec]. When there are three sound sources,
the calculation time in the case of no layering is 19.6 [msec], and the calculation time in the case
of layering is 8.3 [msec]. When there are four sound sources, the calculation time in the case of
no hierarchy is 17.5 [msec], and the calculation time in the case of hierarchy is 8.7 [msec]. As
described above, even in the state where four sound sources exist and are uttered
simultaneously, the processing ends at a frame period of 10 [msec] or less. Further, as shown in
FIG. 14, when hierarchically searching as in the present embodiment, the average processing
time is reduced to 50% or less compared to the case where searching is performed without
hierarchical regardless of the number of sound sources. is made of. As described above,
according to the present invention, it is possible to improve the processing efficiency of
estimating the sound source direction.
[0181]
FIG. 15 is a diagram showing an example of the result of speech recognition of an acoustic signal
separated for each sound source. In FIG. 15, the horizontal axis is a relative angle, and the
vertical axis is a word accuracy rate (WCR). The speech used for evaluation used speech data of
10 people x 216 words in the ATR speech database to recognize speeches simultaneously uttered
by four speakers. The transfer functions used for the evaluation are the same intervals as in FIG.
Interpolation was performed on this transfer function to estimate transfer functions at one
degree intervals. In the evaluation, the WCR was evaluated when the interval of the estimated
transfer function was changed. Evaluation was evaluated about no interpolation (NONE), FDLI
method, TDLI method, and FTDLI method similarly to FIG.
[0182]
As shown in FIG. 15, when the interpolation is not used, the recognition performance is degraded
when the transfer function exceeds a 30 degree interval. In addition, when FTDLI method is used
as the interpolation method, recognition performance can be maintained as compared with other
interpolation methods. For example, when the transfer function is 90 degrees apart, the
recognition rate is improved by about 7% as compared to the FDLI method.
03-05-2019
44
[0183]
As described above, when the sound source direction estimation device 11 according to the first
embodiment or the second embodiment is applied to the sound processing device 1, a transfer
function is used in the case where the word accuracy rate is fine spatial resolution for each
degree. The amount of calculation involved in the search can be reduced while maintaining the
same performance as in the case. As a result, it is possible to accurately estimate the sound
source direction in real time for each frame.
[0184]
In the first and second embodiments, an example has been described in which the matrix
calculation unit 106 calculates the GSVD matrix and estimates the direction of the sound source
based on the calculated GSVD matrix, but the present invention is not limited thereto. . The
matrix calculation unit 106 may calculate a GEVD (Generalized Eigenvalue Expansion) matrix and
estimate the sound source direction based on the calculated GSVD matrix. Also in this case, as
described in the first embodiment or the second embodiment, the sound source direction
estimation device 11 calculates the number of layers and the search interval (granularity) that
minimize the total number of searches, or the calculation cost Calculate the number of layers and
the particle size that minimizes. Then, based on the calculated number of layers and the search
interval (granularity), the sound source direction estimation apparatus 11 first performs a search
at a coarse search interval, and selects the next search range from the search range. The sound
source direction estimation device 11 narrows the search interval in the selected search range.
The sound source direction estimation device 11 can reduce the calculation cost while accurately
estimating the sound source direction.
[0185]
In the first embodiment and the second embodiment, although the example in which the numberof-layers calculation unit 107 calculates both the number of layers and the search interval used
for the search has been described, the present invention is not limited thereto. The number of
layers may be selected in advance by the user of the apparatus, and only the search interval may
be calculated.
03-05-2019
45
[0186]
In the first and second embodiments, the FDLI method, the TDLI method, and the FTDLI method
are used as techniques for interpolating the transfer function. However, the present invention is
not limited to this. Other techniques may be used to interpolate the transfer function.
[0187]
A program for realizing the function of the sound source direction estimation device 11
according to the present invention is recorded in a computer readable recording medium, and the
computer system reads the program recorded in the recording medium and executes the
program. An estimate of the direction may be made. Here, the computer system includes an
OS and hardware such as peripheral devices. The "computer system" also includes a WWW
system provided with a homepage providing environment (or display environment). The term
"computer-readable recording medium" refers to a storage medium such as a flexible disk, a
magneto-optical disk, a ROM, a portable medium such as a ROM or a CD-ROM, or a hard disk
built in a computer system. Furthermore, the "computer-readable recording medium" is a volatile
memory (RAM) in a computer system serving as a server or a client when a program is
transmitted via a network such as the Internet or a communication line such as a telephone line.
In addition, those that hold the program for a certain period of time are also included.
[0188]
The program may be transmitted from a computer system in which the program is stored in a
storage device or the like to another computer system via a transmission medium or by
transmission waves in the transmission medium. Here, the transmission medium for
transmitting the program is a medium having a function of transmitting information, such as a
network (communication network) such as the Internet or a communication line (communication
line) such as a telephone line. Further, the program may be for realizing a part of the functions
described above. Furthermore, it may be a so-called difference file (difference program) that can
realize the above-described functions in combination with a program already recorded in the
computer system.
[0189]
03-05-2019
46
DESCRIPTION OF SYMBOLS 1 ... sound processing apparatus, 11 ... sound source direction
estimation part, 12 ... sound source localization part, 13 ... sound source separation part, 14 ...
sound feature quantity extraction part, 15 ... speech recognition part 16 recognition result output
unit 101 speech input unit 102 short time Fourier transform unit 103 first correlation matrix
calculation unit 104 noise database 105 Second correlation matrix calculation unit 106 106
matrix calculation unit 107 hierarchy number calculation unit 108 first space spectrum
calculation unit 109 second space spectrum calculation unit 110 Peak search unit, 111 ... STF
processing unit, 112 ... Transfer function storage unit, 113 ... Output unit S ... Number of layers,
dS ... Search interval (granularity), A ... Transfer function
03-05-2019
47