Patent Translate Powered by EPO and Google Notice This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate, complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or financial decisions, should not be based on machine-translation output. DESCRIPTION JP2014059180 Abstract: To provide a sound source direction estimation device, a sound source direction estimation method, and a sound source direction estimation program capable of improving the processing efficiency concerning estimation of a sound source direction. A transfer function storage unit 112 stores transfer functions from a sound source for each direction of a sound source, and the number of layers to be searched based on a desired search range and a desired spatial resolution for searching the direction of the sound source. And the number-of-layers calculation unit 107 that calculates the search interval for each hierarchy, the search range is searched using the transfer function for each search interval, the direction of the sound source is estimated based on the search result, And a sound source localization unit (peak search unit 110, STF processing unit 111) which updates the search range and the search interval based on the number of layers calculated by the calculation unit and estimates the direction of the sound source. [Selected figure] Figure 3 Sound source direction estimation device, sound source direction estimation method, and sound source direction estimation program [0001] The present invention relates to a sound source direction estimation device, a sound source direction estimation method, and a sound source direction estimation program. [0002] It has been proposed that speech recognition be performed on the sound emitted from the sound 03-05-2019 1 signal emitted from the sound source. In such speech recognition, a noise signal is separated from an acoustic signal, or the noise signal is suppressed to extract an acoustic signal of a target to be recognized. Then, for example, voice recognition is performed on the extracted acoustic signal. In such a system, in order to extract a voice to be recognized, it is estimated whether the direction in which the acoustic signal is emitted is known or the direction in which the acoustic signal is emitted. [0003] For example, in Patent Document 1, the type of the sound source of the acoustic signal is identified based on the acoustic feature amount of the input acoustic signal, and the sound source direction is estimated for the acoustic signal of the identified type of sound source. Moreover, in Patent Document 1, the information of such sound source direction is estimated (hereinafter referred to as sound source localization) using GEVD (generalized eigenvalue decomposition) -MUSIC method or GSVD (generalized singular value decomposition) -MUSIC method ing. As described above, when the GEVD-MUSIC method or the GSVD-MUSIC method is used, the calculation efficiency of the sound source localization is increased in the sound source estimation device. [0004] Unexamined-Japanese-Patent No. 2012-42465 [0005] However, in such a sound source estimation device, in the range where the sound source direction is searched, transfer functions associated with each sound source direction are obtained in advance by measurement or calculation, and stored in the device. Then, in such a sound source estimation device, the space spectrum is calculated using the transfer function stored in the storage unit, and the sound source direction is determined based on the calculated space spectrum. Therefore, in order to increase the estimation accuracy of the sound source direction, transfer functions associated with a large number of sound source directions are required. Therefore, in the conventional sound source estimation apparatus, there 03-05-2019 2 is a problem that the calculation amount is large and the calculation efficiency is poor in order to increase the estimation accuracy of the sound source direction. [0006] The present invention has been made in view of the above-described points, and an object of the present invention is to provide a sound source direction estimation device, a sound source direction estimation method, and a sound source direction estimation program capable of improving processing efficiency concerning estimation of sound source direction. And [0007] (1) In order to achieve the above object, a sound source direction estimation device according to one aspect of the present invention searches for a direction of the sound source, a transfer function storage unit that stores a transfer function from the sound source for each direction of the sound source. Calculating the number of layers to be searched and the search interval for each hierarchy based on the desired search range and the desired spatial resolution, and searching the search range using the transfer function at each search interval, The direction of the sound source is estimated based on the searched result, and the search range and the search interval are updated until the number of layers calculated by the calculation unit is reached based on the estimated direction of the sound source, and the direction of the sound source And a sound source localization unit for estimating [0008] (2) Further, in the sound source direction estimation device according to the aspect of the present invention, the sound source localization unit calculates the nth (n is an integer of 1 or more) hierarchy in the search range defined in advance. The search is performed at the specified search interval, and based on the search result, at least one of the search intervals in the search range is updated as a search range of the (n + 1) -th layer, and the updated (n + 1) -th layer is updated. The search interval of the (n + 1) -th layer to be searched is updated based on the search range and the desired spatial resolution, and the updated search range of the (n + 1) -th layer and the updated ((n + 1)) hierarchy The direction of the sound source is updated and estimated until the number of layers (n + 1) reaches the number of layers calculated by the calculation unit using the search interval of n + 1) layers and the transfer function corresponding to the direction You may [0009] (3) Further, in the sound source direction estimation device according to the aspect of the present invention, the calculation unit may calculate the number of layers and the search interval 03-05-2019 3 so that the number of searches for each layer in all the layers is equal. It may be calculated. [0010] (4) Further, in the sound source direction estimation device according to the aspect of the present invention, the calculation unit calculates the number of layers and the search interval such that a total number of searches in all layers is minimized. You may do so. [0011] (5) Further, in the sound source direction estimation device according to the aspect of the present invention, the sound source localization unit may store the transfer function corresponding to an azimuth of the search interval in the transfer function storage unit. When it is determined that the transfer function corresponding to the direction of the search interval is stored in the transfer function storage unit, the transfer function corresponding to the direction of the search interval is read from the transfer function storage unit When it is determined that the transfer function corresponding to the direction of the search interval is not stored in the transfer function storage unit, an interpolation transfer function is calculated by interpolating the transfer function corresponding to the direction of the search interval, The direction of the sound source may be estimated using the read transfer function or the calculated interpolation transfer function. [0012] (6) Further, in the sound source direction estimation device according to one aspect of the present invention, the calculation unit is a calculation that is a total value of a search cost for the number of searches in the search range and an interpolation cost for the interpolation. The number of layers and the search interval may be calculated so as to minimize the cost. [0013] (7) In order to achieve the above object, a sound source direction estimation method according to an aspect of the present invention is a sound source direction estimation method in a sound source direction estimation device, and a calculation unit is a desired one for searching for a sound source direction. A calculation procedure for calculating the number of layers to be searched and the search interval for each layer based on a search range and a desired spatial resolution, and a sound source localization unit is stored in the transfer function storage unit for each direction of the sound source Searching the search range at each search interval using a transfer function from the sound source, and estimating the direction of the sound source based on the search result, and the sound source localization unit is directed to the direction of the sound source estimated And updating the search range and the search interval based on the 03-05-2019 4 number of layers calculated by the calculation procedure to estimate the direction of the sound source. [0014] (8) In order to achieve the above object, a sound source direction estimation program according to an aspect of the present invention is based on a desired search range and a desired spatial resolution for searching for the direction of a sound source in a computer of a sound source direction estimation device. Using the transfer function from the sound source stored in the transfer function storage unit for each direction of the sound source, the search range is calculated using the number of layers to be searched and the calculation procedure for calculating the search interval for each layer. The number of layers calculated according to the calculation procedure: the procedure of searching for each search interval and estimating the direction of the sound source based on the search result; and the search range and the search interval based on the estimated direction of the sound source And updating the information to estimate the direction of the sound source. [0015] According to the aspects (1), (2), (7) and (8) of the present invention, the calculation unit performs the search based on the desired search range and the desired spatial resolution for searching the direction of the sound source. Since the number of layers to be performed and the search interval for each layer are calculated, the processing time required for the direction of the sound source can be shortened. According to the aspects (3) and (4) of the present invention, since the calculation unit can calculate the number of layers to be searched with a small amount of calculation and the search interval for each layer, the processing time for the direction of the sound source can be shortened. Can. According to the aspect (5) of the present invention, when the transfer function corresponding to the direction to be searched is not stored in the transfer function storage unit, the sound source localization unit searches for the direction using the interpolation transfer function complemented with the transfer function. When the transfer function corresponding to is stored in the transfer function storage unit, since the direction of the sound source is estimated using the read transfer function, the direction of the sound source can be accurately estimated. 03-05-2019 5 According to the aspect (6) of the present invention, the number of layers and the search interval are set so that the sound source localization unit minimizes the calculation cost which is the total value of the search cost for the number of searches and the interpolation cost for interpolation. Since the calculation is performed, the processing time required for the direction of the sound source can be shortened. [0016] It is a figure explaining the environment in direction estimation of a sound source. It is a figure which shows the outline of a process of the sound processing apparatus which concerns on 1st Embodiment. It is a block diagram of a sound source localization part concerning a 1st embodiment. It is a block diagram of a sound source separation part concerning a 1st embodiment. It is a flowchart which shows the procedure of the hierarchy search process which concerns on 1st Embodiment. It is a figure explaining the procedure of the hierarchy search process which concerns on 1st Embodiment. It is a figure explaining the calculation procedure of the hierarchy value concerning a 1st embodiment. It is a figure explaining the calculation procedure of the number of layers and space ¦ interval which concern on 2nd Embodiment. It is a figure explaining the search cost and interpolation cost of the 2nd hierarchy based on 2nd Embodiment. It is a figure explaining evaluation conditions. It is a figure explaining the search point in evaluation. It is a figure which shows an example of the result of having evaluated the error of the transfer function when using PEE, SD, and SDR when changing azimuth ¦ direction (2), and the linearity of an interpolation coefficient. It is a figure which shows an example of the average error of the arrival direction estimation of the sound source by the presence or absence of interpolation. It is a figure which 03-05-2019 6 shows an example of the result of having evaluated calculation cost. It is a figure which shows an example of the result of audio ¦ voice recognition with respect to the acoustic signal isolate ¦ separated for every sound source. [0017] Hereinafter, embodiments of the present invention will be described in detail. In addition, this invention is not limited to the embodiment which concerns, A various change is possible within the range of the technical thought. [0018] First Embodiment First, an outline of the present embodiment will be described. FIG. 1 is a diagram for explaining an environment in the direction estimation of a sound source (hereinafter referred to as a sound source direction estimation). In FIG. 1, the left-right direction of the paper surface is the Y direction, and the direction perpendicular to the Y direction is the X direction. Numerals 2a and 2b show a sound source. The sound source direction estimation device (sound source direction estimation unit) 11 of the present embodiment performs sound source localization and sound source separation in order to recognize such a plurality of sound sources. In the example shown in FIG. 1, the sound source 2a is in the direction of the angle a1 counterclockwise with respect to the X axis, and the sound source 2b is in the direction of the angle a2 clockwise with respect to the X axis. [0019] FIG. 2 is a diagram showing an outline of processing of the sound processing apparatus 1 according to the present embodiment. As shown in FIG. 2, the sound processing apparatus 1 includes a sound source direction estimation unit 11, an acoustic feature extraction unit (Acoustic Feature Extraction) 14, a speech recognition unit (Automatic Speech Recognition) 15, and a recognition result output unit 16. Configured The sound source direction estimation unit 11 includes a sound source localization unit (Sound Source Localization) 12 and a sound source separation unit (Sound Source Separation) 13. [0020] 03-05-2019 7 The sound source localization unit 12 has an acoustic signal input unit, and performs, for example, Fourier transform on the acoustic signal collected by a plurality of microphones. The sound source localization unit 12 estimates sound source directions with respect to a plurality of Fourier-transformed sound signals (hereinafter referred to as sound source localization). The sound source localization unit 12 outputs information indicating the result of the sound source localization to the sound source separation unit 13. The sound source separation unit 13 performs sound source separation of the target sound and the noise on the information indicating the result of the sound source localization input from the sound source localization unit 12. The sound source separation unit 13 outputs a signal corresponding to each sound source separated by the sound source to the acoustic feature quantity extraction unit 14. The target sound is, for example, a sound emitted from a plurality of speakers. Noise is noise other than the target sound, for example, wind noise, sound emitted by other devices placed in the room where the sound was collected, and the like. [0021] The acoustic feature quantity extraction unit 14 extracts the acoustic feature quantity of the signal corresponding to each sound source input from the sound source separation unit 13, and outputs information indicating the extracted acoustic feature quantity to the speech recognition unit 15. The speech recognition unit 15 performs speech recognition based on the acoustic feature amount input from the acoustic feature amount extraction unit 14 when the sound source includes speech uttered by a human being, and recognizes the recognized result as a recognition result output unit 16 Output to The recognition result output unit 16 is, for example, a display device, an acoustic signal output device, or the like. The recognition result output unit 16 displays information based on the recognition result input from the speech recognition unit 15 on, for example, a display unit. The sound processing apparatus 1 or the sound source direction estimation unit 11 may be incorporated in, for example, a robot, a car, an aircraft (including a helicopter), a portable terminal, or the like. The portable terminal is, for example, a portable telephone terminal, a portable information terminal, a portable game terminal or the like. [0022] In the present embodiment, the sound source direction is hierarchically estimated in order to reduce the calculation cost while improving the spatial resolution of the sound source in order to 03-05-2019 8 improve the estimation accuracy of the sound source direction. Note that to estimate the sound source direction hierarchically, the sound source direction estimation unit 11 (FIG. 2) first divides the entire predetermined search range into coarse search intervals, searches at the coarse search intervals, and searches the sound source direction. presume. Next, the sound source direction estimation unit 11 selects one search interval corresponding to the estimated direction, and sets the selected search interval as a new search range. Then, the sound source direction estimation unit 11 divides the new search range into finer search intervals, and estimates the sound source direction by searching at the fine search intervals. Thus, in the present embodiment, for example, the next search interval is narrowed from the current search interval to perform the search. As a result, in the present embodiment, the processing time required to estimate the sound source direction can be shortened. [0023] FIG. 3 is a block diagram of the sound source localization unit 12 according to the present embodiment. As shown in FIG. 3, the sound source localization unit 12 includes a voice input unit 101, a short time Fourier transform unit 102, a first correlation matrix calculation unit 103, a noise database 104, a second correlation matrix calculation unit 105, a matrix calculation unit 106, and a hierarchy. Number calculation unit 107, first space spectrum calculation unit 108, second space spectrum calculation unit 109, peak search unit 110, spatial transfer function (STF) processing unit 111, transfer function storage unit 112, and output unit 113 Is composed including. [0024] The voice input unit 101 includes M sound collecting means (for example, microphones) (M is an integer of 2 or more), and the respective sound collecting means are disposed at different positions. The voice input unit 101 is, for example, a microphone array provided with M microphones. The voice input unit 101 outputs the sound wave received by each sound collection unit to the short time Fourier transform unit 102 as an acoustic signal of one channel. Note that the audio input unit 101 may convert an acoustic signal from an analog signal to a digital signal, and output the audio signal converted into the digital signal to the short-time Fourier transform unit 102. [0025] 03-05-2019 9 The short-time Fourier transform unit 102 performs short-time Fourier transform (STFT) for each frame in the time domain on the audio signal of each channel input from the voice input unit 101 to obtain Generate an input signal. The short time Fourier transform is a transform that performs a Fourier transform by multiplying the function while shifting the window function. A frame is a time interval of a predetermined length (frame length) or a signal included in the time interval. The frame length is, for example, 10 [msec]. The short time Fourier transform unit 102 generates a matrix X (ω, f) for each frequency f and frame time f using an input signal subjected to short time Fourier transform for each frame, and generates the generated input vector X of M columns (M Output ω and f) to the first correlation matrix calculator 103. [0026] The first correlation matrix calculation unit 103 uses the input vector X (ω, f) input from the short time Fourier transform unit 102 to generate the spatial correlation matrix R Calculate ω, f). The first correlation matrix calculating unit 103 outputs the calculated spatial correlation matrix R (ω, f) to the matrix calculating unit 106. The spatial correlation matrix R (ω, f) is a square matrix of M rows and M columns. [0027] [0028] In equation (1), f represents the current frame time, and TR is the length (number of frames) of the section used when calculating the spatial correlation matrix R (ω, f). The length of this section is called the window length. τ is a variable indicating a frame time (not limited to the current frame time), and is a value in the range of 0 to TR-1. Also, * represents a complex conjugate transpose operator of a vector or matrix. In equation (1), the spatial correlation matrix R (ω, f) is smoothed by the TR frame to improve the robustness against noise. That is, equation (1) is a correlation between channels n (n is an integer of 1 or more) and channel m (m is an integer of 1 or more different from n). The product of the input signal vector and the complex conjugate is averaged over the interval of the current frame time 0 to the window length TR-1. 03-05-2019 10 [0029] The noise database 104 stores in advance a frequency ω and a noise source matrix N (ω, f) for each frame time f (hereinafter referred to as a noise matrix). The noise is noise other than the target sound, for example, wind noise, sound emitted from other devices placed in the room where the sound is collected, and the like. [0030] The second correlation matrix calculation unit 105 reads the noise matrix N (ω, f) stored in the noise database 104, and uses the read N (ω, f) to calculate the following equation for each frequency ω and frame time f A noise correlation matrix (hereinafter referred to as a noise correlation matrix) K (ω, f) is calculated by (2). The noise correlation matrix K (ω, f) is a noise correlation matrix calculated from the signal to be suppressed (whitened) at the time of localization, but in this embodiment, it is assumed to be a unit matrix for simplification. The second correlation matrix calculating unit 105 outputs the calculated noise correlation matrix K (ω, f) to the matrix calculating unit 106. [0031] [0032] In equation (2), Tk is the length (number of frames) of the section used when calculating the noise correlation matrix K (ω, f). τ k is a variable indicating a frame time (not limited to the current frame time), and is a value in the range of 0 to Tk−1. N <*> represents the complex conjugate transpose operator of the matrix N. As in equation (2), the noise correlation matrix K (ω, f) is the correlation between the noise signal of channel n and the noise signal of channel m, and the noise matrix N of channel n and channel k The product of the above and its complex conjugate is averaged over the window length Tk-1 from the current frame time 0. 03-05-2019 11 [0033] The matrix calculation unit 106 uses the noise correlation matrix K (ω, f) input from the second correlation matrix calculation unit 105 to generate the spatial correlation matrix R (ω, f) input from the first correlation matrix calculation unit 103. The eigenvector is calculated for each of the frequency ω and the frame time f in the space where The matrix calculation unit 106 outputs the calculated eigenvectors to the number-of-layers calculation unit 107. The matrix calculation unit 106 multiplies the spatial correlation matrix R (ω, f) by the inverse matrix K <−1> (ω, f) of the noise correlation matrix K (ω, f) from the left side, K <−1> (ω F) For R (ω, f), GSVD (generalized singular-value decomposition) represented by the following formula (3) is calculated by the MUSIC method. The matrix calculation unit 106 calculates the vector El (ω, f) and the eigenvalue matrix Λ (ω, f) satisfying the relationship of Expression (3) by the GSVDMUSIC method. By this processing, the matrix calculation unit 106 decomposes the target sound into a subspace of the target sound and a noise subspace. In the present embodiment, noise can be whitened by equation (3). [0034] [0035] In equation (3), vector El (ω, f) is left-singular vectors and vector Er <*> (ω, f) is the complex of right-singular vectors It is conjugate. The vector E (ω, f) is a vector having eigenvectors e 1 (ω, f),..., E M (ω, f) as element values. Further, the eigenvectors e1 (ω, f),..., EM (ω, f) are eigenvectors respectively corresponding to the eigenvalues λ1 (ω, f), ···, λM (ω, f). Here, M is the number of microphones. The eigenvalue matrix 値 (ω, f) is the following equation (4). In Equation (4), diag represents a diagonal matrix. [0036] [0037] The number-of-layers calculating unit 107 calculates the number of layers to be searched by the 03-05-2019 12 first space spectrum calculating unit 108 to the STF processing unit 111 and the search interval for searching, and calculates the number of layers and the search interval, and the matrix calculation The eigenvectors input from unit 106 are output to first space spectrum calculation unit 108. This search interval corresponds to spatial resolution. [0038] The first space spectrum calculation unit 108 uses the number of layers and the search interval input from the layer number calculation unit 107, the eigenvectors, and the transfer function A or the interpolation transfer function A ^ input from the STF processing unit 111, For f, for each frequency ω and sound source direction 、, the spatial spectrum P (ω, ψ, f) before the frequency ω is integrated is calculated by the following equation (5). When the interpolation transfer function A ^ is input from the STF processing unit 111, the first space spectrum calculation unit 108 performs an operation using the interpolation transfer function A ^ instead of the transfer function A of equation (5). . The first space spectrum calculation unit 108 outputs the calculated space spectrum P (ω, ψ, f) and the number of layers input from the number-oflayers calculation unit 107 to the second space spectrum calculation unit 109. [0039] [0040] In equation (5), ¦ ... ¦ indicates an absolute value. Equation (5) represents the ratio of the spatial spectrum to the component of the overall transfer function due to noise. In equation (5), em (ω, f) is the left singular vector El (ω, f) (= e1 (ω, f),..., EM (ω, f)) of equation (3). Also, A <*> is a complex conjugate of the transfer function A. LS is the number of sound sources, and is an integer of 0 or more. Further, A (ω, ψ) is a measured known transfer function stored in advance in the transfer function storage unit 112, and is given by the following expression (6). 03-05-2019 13 [0041] [0042] Note that ψi is the direction of the sound source direction measured in advance, that is, the direction of the transfer function A. i is an integer of 1 or more. M is the number of microphones. T indicates a transposed matrix. [0043] The second space spectrum calculation unit 109 averages the space spectrum P (ω, ψ, f) input from the first space spectrum calculation unit 108 in the ω direction using the following equation (7): Calculate (ψ, f). The second space spectrum calculation unit 109 outputs the calculated averaged space spectrum P (ψ, f) and the number of layers and the search interval input from the first space spectrum calculation unit 108 to the peak search unit 110. [0044] [0045] In equation (7), ω [k] represents the frequency corresponding to the kth frequency bin. The frequency bin is a discretized frequency. Also, kh and kl are indices of frequency bins corresponding to the maximum frequency (upper limit frequency) and the minimum frequency (lower limit frequency) in the frequency domain. In equation (7), kh−kl + 1 is the number of spatial spectra P (ω, ψ, f) that are symmetrical of addition (Σ). The reason why 1 is added in this way is that each frequency ω is discretized, so the space spectrum P (ω [k], ψ, f) related to the upper limit frequency which is both ends of the frequency band, and the lower limit frequency The spatial spectrum P (ω [k], ψ, f) pertaining to is all to be added. 03-05-2019 14 [0046] The peak search unit 110 receives the averaged spatial spectrum P (算出, f), the number of layers, and the search interval from the second spatial spectrum calculation unit 109. The peak search unit 110 detects an azimuth ψ <[l]> (1 is a value in a range of 1 or more and Ls or less) which is a peak value of the input averaged spatial spectrum P (ψ, f). The peak search unit 110 determines whether or not the estimation process of the first space spectrum calculation unit 108 to the STF process 111 has ended according to the number of layers input from the second space spectrum calculation unit 109. When the peak search unit 110 determines that the estimation process is completed for the number of layers, the peak search unit 110 outputs the detected peak value ψ <[l]> to the output unit 113 as an estimated azimuth ψ. When the peak search unit 110 determines that the estimation process has not been completed for the number of layers, the peak search unit 110 outputs the detected peak value ψ <[l]>, the number of layers, and the search interval to the STF selection unit 111. [0047] The STF processing unit 111 reads the transfer function A from the transfer function storage unit 112 using the peak value ψ <[l]>, the number of layers, and the search interval input from the peak search unit 110, or the interpolation transfer function A Calculate ^. Specifically, it is determined whether the transfer function A corresponding to the direction to be searched is stored in the transfer function storage unit 112 or not. When it is determined that the transfer function A corresponding to the direction to be searched is stored in the transfer function storage unit 112, the STF processing unit 111 reads the corresponding transfer function A from the transfer function storage unit 112. When it is determined that the transfer function A corresponding to the direction to be searched is not stored in the transfer function storage unit 112, the STF processing unit 111 calculates a corresponding interpolation transfer function A ^. The STF processing unit 111 outputs the read transfer function A or the calculated interpolation transfer function A ^ to the first space spectrum calculation unit 108. When the search is completed with respect to a predetermined search range (also referred to as one hierarchy), the STF processing unit 111 describes a new search range and a search interval based on the direction in which the peak value is detected. Calculate as follows. The STF processing unit 111 outputs the calculated new search range and search interval to the first space spectrum calculation unit 108. [0048] 03-05-2019 15 The output unit 113 outputs, for example, the estimated azimuth ψ input from the peak search unit 110 to the sound source separation unit 13 (see FIG. 2). Also, for example, when only the sound source direction estimation unit 11 is attached to the robot, the output unit 113 may be a display device (not shown). In this case, the output unit 113 may display the estimated azimuth ψ on the display unit as character information or as illustrated. [0049] As described above, the sound source direction estimation device of the present embodiment includes a transfer function storage unit (112) that stores the transfer function from the sound source for each direction of the sound source, and a desired search range for searching the direction And the calculation unit (layer number calculation unit 107) for calculating the number of layers to be searched and the search interval for each layer based on the desired spatial resolution and the search range by using the transfer function for each search interval A sound source localization unit that estimates the direction of the sound source by estimating the direction of the sound source based on the calculated result, updating the search range and the search interval to the number of layers calculated by the calculator based on the estimated direction And a peak search unit 110 and an STF processing unit 111). With such a configuration, according to the present embodiment, the number-of-layers calculation unit 107 calculates the number of layers to be searched and the search interval. Next, the sound source localization unit 12 (FIG. 3) divides the entire predetermined search range into coarse search intervals, and searches at the coarse search intervals to estimate the sound source direction. Next, the sound source localization unit 12 selects one search interval corresponding to the estimated direction, and updates the search range with the selected search interval as a new search range. Then, the sound source localization unit 12 divides the new search range into smaller search intervals, updates the search range, and estimates the sound source direction by performing the search at the fine search intervals. As a result, in the present embodiment, the processing time required to estimate the sound source direction can be shortened. [0050] FIG. 4 is a block diagram of the sound source separation unit 13 according to the present embodiment. As shown in FIG. 4, the sound source separation unit 13 is configured to include a first cost calculation unit 121, a second cost calculation unit 122, and a sound source separation processing unit 123. The first cost calculation unit 121 calculates a cost function JGC (W) using a method of sound source separation (GSS (geometrically constrained sound source separation)) 03-05-2019 16 based on constraint conditions, and calculates the cost function JGC (W) as a sound source It is output to the separation processing unit 123. The first cost calculation unit 121 realizes the geometric constraint by designating a transfer function as D of the cost function JGC (W) expressed by the following equation (8). The cost function JGC is a degree of geometric constraint, and is used to calculate the separation matrix W. [0051] [0052] However, in Formula (8), EGC is following Formula (9). [0053] [0054] In equation (9), W is a separation matrix and D is a transfer function matrix. In the present embodiment, by using the interpolation transfer function A ^ interpolated by the STF processing unit 111 or the transfer function A read out to the transfer function matrix D, it is possible to apply geometric constraint to the correct sound source direction. [0055] The second cost calculation unit 122 calculates the cost function JHDSS (W) using the method of sound source separation (HDSS) based on high-dimensional decorrelation as an extension of the independent component analysis, and calculates the cost function JHDSS (W ) Is output to the sound source separation processing unit 123. [0056] The sound source separation processing unit 123 uses the cost function JGC (W) input from the first cost calculation unit 121 and the cost function JHDSS (W) input from the second cost calculation unit 122 to perform the following equation (10) The cost function JGHDSS (W) is 03-05-2019 17 calculated as follows. That is, in the present embodiment, a method in which the GC method and the HDSS method are integrated is used. In the present invention, the method integrated in this way is referred to as a GHDSS (Geometric High-order Dicorrelation-based Source Separation) method. [0057] [0058] In equation (10), α is a scalar and is an integer of 0 or more and 1 or less. Further, in the equation (10), the cost function JHDSS (W) is expressed as the following equation (11). [0059] [0060] In equation (11), E [•] indicates the expected value, and the bold letter E indicates the vector E. Eφ is a cost function used as a substitute for the correlation matrix E in DSS (Dicorrelation-based Source Separation). Moreover, in Formula (11), EHDSS is Formula (12). [0061] 03-05-2019 18 [0062] In equation (12), bold y represents a vector. In addition, it defines as vector E = yy <H> -I. The vector I is an identity matrix. Moreover, the code ¦ symbol H has shown the complex conjugate. Further, φ (y) is a non-linear function, which is the following equation (13). [0063] [0064] In equation (13), p (yi) is a joint probability density function (pdf) of y. In addition, although (phi) (yi) can be variously defined, you may use a hyperbolic-tangent (hyperbolic-tangent-based) function like following Formula (14) as an example of (phi) (yi). [0065] [0066] In equation (14), η is a scaling parameter. [0067] The sound source separation processing unit 123 adaptively calculates the separation matrix W so as to minimize the cost functions JGC and JHDSS according to the following equation (15). The sound source separation processing unit 123 separates the multi-channel sound signal input 03-05-2019 19 to the sound input unit 101 (see FIG. 3) into components for each sound source based on the separation matrix W thus estimated. The sound source separation processing unit 123 outputs the separated components of each sound source to the acoustic feature quantity extraction unit 14, for example. [0068] [0069] In equation (15), t is, μHDSS is the step size used when updating the error matrix, and μGC is the step size used when updating the geometric error matrix. Further, J'HDSS (Wt) is an HDSS error matrix, and is a number sequence obtained by differentiating the matrix JHDSS for each element of the input. J'GC (Wt) is a geometric error matrix, and is a number sequence obtained by differentiating the matrix JGC for each element of the input. The step size μ GC is the following equation (16) using EGC and the geometric error matrix J ′ GC (Wt). The step size μHDSS is the following equation (17) using EHDSS and the geometric error matrix J'GHDSS. [0070] [0071] [0072] As described above, the sound source separation unit 13 of the present embodiment performs sound source separation by sequentially calculating the separation matrix W using the GDHSS method. In the present embodiment, an example in which the sound source separation unit 13 performs 03-05-2019 20 sound source separation using the GDHSS method has been described, but sound source separation may be performed using a known BSS (blind source separation) method or the like. [0073] Next, hierarchical search processing performed by the number-of-tiers calculation unit 107 to the STF processing unit 111 will be described. FIG. 5 is a flowchart showing the procedure of hierarchical search processing according to the present embodiment. FIG. 6 is a diagram for explaining the procedure of hierarchical search processing according to the present embodiment. As shown in FIG. 6, the search range is from 0 to d0. The specific search range is, for example, 0 degrees to 180 degrees. The search interval of the first layer is d1, and the search interval of the second layer is d2. The desired spatial resolution by the user of the sound source direction estimation device 11 is the search interval dS of the Sth layer. Although FIG. 6 shows the case where the search interval of each layer is 4 in order to simplify the description, the number of search intervals is not limited to this. Further, symbols p1 to p5 indicate measurement points at which the transfer function is measured in advance. As a specific example, the measurement point p1 is 0 degrees, the measurement point p2 is 30 degrees, the measurement point p3 is 90 degrees, the measurement point p4 is 130 degrees, and the measurement point p5 is 180 degrees. Further, reference numerals q11 to q15 denote search points in the first layer. As a specific example, the search point q11 is 0 degrees, the search point q12 is 45 degrees, the search point q13 is 90 degrees, the search point q14 is 135 degrees, and the search point q15 is 180 degrees. Note that transfer functions A (ω, ψp1) to A (ω, ψp4) corresponding to the azimuths p1, p2, p3 and p4 are stored in the transfer function storage unit 112 in advance. [0074] (Step S1) The user of the sound source direction estimation device 11 selects a desired spatial resolution. The number-of-tiers calculation unit 107 uses the following equation (18) and the following equation (19) for the number of layers S and the search interval δi for searching based on the desired spatial resolution and search range selected by the user of the apparatus. The calculated number of layers S, the search interval δi, and the eigenvectors input from the matrix calculation unit 106 are output to the first spatial spectrum calculation unit 108. The calculation of the number of layers S and the search interval δi will be described later. After step S1 ends, the process proceeds to step S2. 03-05-2019 21 [0075] [0076] [0077] In equation (19), ds is a search interval of the s-th layer. However, s is an integer of 1 or more and S or less. [0078] (Step S2) The STF processing unit 111 determines whether the transfer function A (ω, ψ) corresponding to the first search point (q11 in FIG. 6) is stored in the transfer function storage unit 112. If the STF processing unit 111 determines that the transfer function A (ω, ψ) corresponding to the first search point is stored in the transfer function storage unit 112 (step S2; Yes), the process proceeds to step S3. If it is determined that the transfer function A (ω, ψ) corresponding to the search point is not stored in the transfer function storage unit 112 (step S2; No), the process proceeds to step S4. [0079] (Step S3) When it is determined that the transfer function A (ω, ψ) corresponding to the first search point is stored in the transfer function storage unit 112, the STF processing unit 111 stores the transfer function A in the transfer function storage unit 112 The transfer function A (ω, ψ) corresponding to the first search point (q11 in FIG. 6) is read, and the read transfer function A (ω, ψ) is output to the first spatial spectrum calculation unit 108. After the step S3 ends, the process proceeds to the step S5. 03-05-2019 22 [0080] (Step S4) When it is determined that the transfer function A (ω, ψ) corresponding to the first search point is not stored in the transfer function storage unit 112, the STF processing unit 111 determines that the two adjacent to the first search point The interpolation transfer function A ^ of the first search point is determined by interpolation using the transfer function corresponding to the measurement point. As an example, in the case of the search point q12, the interpolation transfer function A ^ of the search point q12 is calculated using the transfer function A of the two measurement points p2 and p3 adjacent to this point. The STF processing unit 111 may use, for example, the FDLI method, the TDLI method, the FTDLI method, or the like for the interpolation. Each interpolation method will be described later. [0081] (Step S5) The first space spectrum calculation unit 108 uses the transfer function A input from the STF processing unit 111 or the interpolation transfer function A ^ to obtain the space spectrum P (ω, ψ,. f) Calculate. Next, the second space spectrum calculation unit 109 uses the space spectrum P (ω, ψ, f) calculated by the first space spectrum calculation unit 108 to calculate an averaged space spectrum P (ψ , F) and outputs information indicating the calculated averaged spatial spectrum P (ψ, f) and the number of the search point to the peak search unit 110. The search point number is a number assigned corresponding to the search points q11 to q15. After the step S5 ends, the process proceeds to the step S6. [0082] (Step S6) The peak search unit 110 determines whether the search of all the search points in the search range is completed. If the peak search unit 110 determines that the search for all the search points in the search range is completed (Step S6; Yes), the process proceeds to Step S8, and it is determined that the search for all the search points in the search range is not completed. If it is determined (Step S6; No), the process proceeds to Step S7. [0083] 03-05-2019 23 (Step S7) If it is determined that the search of all the search points in the search range is not completed, the peak search unit 110 outputs information indicating an instruction to search for the next search point to the STF processing unit 111. The next search point is, for example, when the process of the search point q11 is completed in step S5, the search point q12 is the next search point. The search may be performed from 0 to d0 or may be performed from d0 to 0. [0084] (Step S8) If it is determined that the search of all the search points in the search range is completed, the peak search unit 110 determines from among all the averaged space spectra P (ψ, f) calculated in the search range. The azimuth ψ <[i]> which is the maximum value is extracted. After completion of step S8, the process proceeds to step S9. [0085] (Step S9) The peak search unit 110 determines whether the search has ended in all the layers calculated in step S1. If the peak search unit 110 determines that the search has ended in all the layers (step S9; Yes), the process proceeds to step S10, and if it is determined that the search has not ended in all the layers (step S9; No) , And proceeds to step S11. [0086] (Step S10) The peak search unit 110 outputs the extracted direction ψ <[i]> to the output unit 113 as the estimated direction ψ ^. Next, the output unit 113 outputs, for example, the estimated azimuth ψ ^ input from the peak search unit 110 to the acoustic feature quantity extraction unit 14 (see FIG. 2), and ends the process of sound source direction estimation. [0087] (Step S11) When the STF processing unit 111 determines that the search is not completed in all the layers calculated in step S1, two search points adjacent to the azimuth ψ <[i]> having the peak value ((step S11) ψ <[i]> − δi and the search point (ψ <[i]> + δi) are selected as a section 03-05-2019 24 to be searched next. Hereinafter, ψ <[i]> − δi is expressed as ψ <[i −]>, and ψ <[i]> + δi is expressed as ψ <[i +]>. As an example, in FIG. 6, when the search point at which the peak value is detected is q13, the STF processing unit 111 selects the search point q12 and the search point q14 as two adjacent search points. In this case, the search range of the second layer is the search points q12 to q14, and the width of the search range is 2d1. After completion of step S10, the process proceeds to step S11. [0088] (Step S12) The STF processing unit 111 calculates the search interval d used in the search of the second layer. The calculation of the search interval d in each layer will be described later. After the step S12 ends, the process returns to the step S2. The first space spectrum calculation unit 108 to the STF processing unit 111 repeat steps S2 to S9 in the second layer, and use the transfer function (or interpolation transfer function) of each interval to obtain an averaged space spectrum P (ψ, f ) And estimate the azimuth ψ having a peak value. After the search of the second hierarchy is completed, the STF processing unit 111 selects two search points adjacent to the search point having the peak value, and calculates a search interval d to be used in the next hierarchy. Thereafter, the first space spectrum calculation unit 108 to the STF processing unit 111 perform sound source direction estimation by repeating steps S2 to S12 for all the layers calculated in step S1. [0089] Next, the interpolation performed by the STF processing unit 111 will be described. The STF processing unit 111 interpolates the transfer function A using any one of (1) FDLI method, (2) TDLI method, and (3) FTDLI method described below, for example, to calculate an interpolation transfer function A ^. Generate In addition, each transfer function A in two measured points measured is represented like following Formula (20) and following Formula (21). These transfer functions A are stored in advance in the transfer function storage unit 112. The transfer function A (ω, ψ1) is the transfer function in the direction ψ1, and the transfer function A (ω, ψ2) is the transfer function in the direction ψ2. [0090] [0091] 03-05-2019 25 [0092] (1) FDLI (Frequency Domain Linear or Bi-Linear Interpolation) Method In the FDLI (linear interpolation in the frequency domain) method, the STF processing unit 111 uses the following equation (22) to make a linear relationship between two measurement points. Interpolation is performed to calculate an interpolation transfer function A ^. [0093] [0094] In Formula (22), D <A> is an interpolation coefficient, and is a value of 0 or more and 1 or less. The FDLI method is characterized in that the phase can be linearly interpolated. [0095] (2) TDLI (Time Domain Linear Interpolation) Method In TDLI (linear interpolation in the time domain) method, it is expressed as in the following equation (23). [0096] [0097] However, in equation (23), dψ ^ is the following equation (24). [0098] In equations (23) and (24), kψ1 and kψ2 are geometrically determined coefficients, dψ1 and dψ2 are geometrically determined time delays, and am (t, ψ) is Am (t) It is an expression in the time domain of ω, ψ). 03-05-2019 26 If equation (23) is regarded as amplitude interpolation and equation (24) is regarded as phase interpolation, the TDLI method in the frequency domain is expressed as the following equation (25). [0099] [0100] As described above, in the TDLI method, the STF processing unit 111 performs linear interpolation between two measurement points using the following equation (25) to calculate an interpolation transfer function A ^. The TDLI method is characterized in that the amplitude can be linearly interpolated. [0101] (3) FTDLI (Frequency Time Domain Linear or Bi-Linear Interpolation) Method The FTDLI method is a linear interpolation method combining the FDLI method and the TDLI method described above. The FTDLI method obtains the phase from linear interpolation in the frequency domain, and obtains the amplitude from linear interpolation in the time domain. In the FTDLI method, the STF processing unit 111 performs linear interpolation between two measurement points using the following equation (26) to calculate an interpolation transfer function A ^. [0102] [0103] 03-05-2019 27 Next, the calculation procedure of the interpolation transfer function A ^ in the FTDLI method will be described. (I) First, the interpolation transfer function is calculated using Equation (22) and Equation (25). In the following description, each obtained interpolated transfer function is expressed as the following equation (27) and the following equation (28). [0104] [0105] [0106] (II) Next, Formula (27) and Formula (28) are decomposed into phase and amplitude as Formula (29) and Formula (30), respectively. [0107] [0108] [0109] The interpolation transfer function A ^ (ω, ψ ^) is expressed by the equation (26) described above by the equation (29) and the equation (30). The FTDLI method is characterized by the ability to linearly interpolate phase and amplitude. [0110] 03-05-2019 28 Next, the procedure for calculating the optimal hierarchy value and the search interval performed by the hierarchy number calculation unit 107 in step S1 described above will be described. FIG. 7 is a diagram for explaining the calculation procedure of the hierarchy value according to the present embodiment. In the following description, the left side of the search range is set to 0 for convenience in each layer. Also, in each layer, for convenience, it is assumed that there is a peak value in any search interval d1 between the search point and the search point, for example, in the first layer. [0111] In the following description, known values are the search range d0 of the first layer and the search interval dS of the Sth layer. Further, the values to be obtained are the number of layers S in which the number of searches is minimum and the search interval ds in each layer. First, the interval between the directions ψ1 and ψ1 of the two transfer functions is set to d0. That is, when searching the range of interval 0 to interval d0 between the direction ψ1 and the direction ψ1 of the two transfer functions at the interval dS, the number of searches is d0 / dS. In order to minimize the number of searches and reduce the calculation cost, in the present embodiment, searches are hierarchically performed as follows. As shown in FIG. 7, the number of layers is S, and the interval of the s-th layer is ds. However, s is 1 or more and S or less. That is, the interval of the first layer is d1, and the interval of the second layer is d2 (where d2 is smaller than d1). In the layering, first, the search of the s-th layer is performed, and one interval ds including the peak value is selected by the search. Next, the selected interval ds is set as the (s + 1) th layer. In the (s + 1) -th layer, the search is performed with an interval of ds + 1. In the 03-05-2019 29 search of the (s + 1) th layer, one interval ds + 1 including the peak value is selected, and the selected interval ds + 1 is treated as the (s + 2) th layer. Such processing is called hierarchy. Moreover, in the hierarchy, the interval of the upper hierarchy is coarse, and the interval of the lower hierarchy becomes finer. Then, in the S-th layer, since the interval is dS and the number of intervals is dS-1, the number of searches is (dS-1) / dS. The distance ds is hereinafter also referred to as the particle size. [0112] (When the Number of Layers is Two) As shown in FIG. 7, the total number of searches F (d1) in the case of searching up to the second layer (S = 2) is represented by the following expression (31). [0113] [0114] In equation (31), the variable is only d1. Therefore, when equation (31) is partially differentiated with variable d1 to obtain the minimum value of equation (31), equation (32) is obtained. [0115] [0116] Therefore, the interval d1 at which equation (32) becomes 0 is √ (d0dS). In the case of the value of the interval d1, the number of searches in the first layer is the following expression (33), and the number of searches in the second layer is the following expression (34). 03-05-2019 30 [0117] [0118] [0119] The total number of searches can be minimized by equalizing the number of searches in each layer according to equations (33) and (34). [0120] (When the Number of Layers is s) Next, when the number of layers is s, conditions for minimizing the total number of searches will be described. Even if the number of layers is s, the total number of searches can be minimized by equalizing the number of searches in each layer. In the following, the reasoning method is explained using a contradiction method. When the number of layers is s, the total number of searches is d0 / d1 + d1 / d2 +... + D (s-1) / ds. It is assumed that there is a hierarchy in which the total number of searches is the smallest and d (i-1) / di ≠ di / d (i + 1). However, i is an integer of 1 or more and S or less. In two layers i-1 to i + 1, when the search number d (i-1) / di = di / d (i + 1) of each layer, as described in the example of the two layers (d (i-) This assumption is contradictory, as 1) / di) + (di / d (i + 1)) is minimized. That is, there will be no hierarchy of d (i-1) / di ≠ di / d (i + 1). As a result, regardless of the number of layers, when the number of searches in each layer is equal, the total number of searches can be minimized. Therefore, when the number of layers is s, the condition for minimizing the total number of searches is d0 / d1 = d1 / d2 =... = D (s-1) / ds. If this conditional expression is modified, it is expressed as the following expression (35). [0121] 03-05-2019 31 [0122] From equation (35), the particle size dS-1 at which the total number of searches is minimized is the following equation (36). [0123] [0124] (When the number of layers is S) Next, assuming that the number of layers is S, the total number of searches G (S) is calculated according to d0 / d1 = d1 / d2 =... = D (S-1) / dS It becomes following Formula (37). [0125] [0126] Next, in order to obtain S which minimizes the equation (37), partial differentiation of the equation (37) by S results in the following equation (38). [0127] [0128] The total number of searches is minimized when equation (38) is zero. For this reason, it is S = log (d0 / dS) that the total number of searches is minimized. The particle size ds of each layer can be calculated by the following equation (39) by modifying the conditional equation in the same manner as the equation (35) and the equation (36) and substituting S = log (d0 / dS) into the transformed line form . 03-05-2019 32 [0129] [0130] As described above, the number-of-tiers calculation unit 107 (see FIG. 3) calculates the number of tiers S for which the total number of searches is the smallest using S = log (d0 / dS), and uses Equation (39). The interval (granularity) dS at which the total number of searches becomes minimum is calculated. [0131] As described above, in the present embodiment, a broad range search is first performed with coarse resolution and sound source search, and a narrow range search is performed at finer intervals based on the search result. According to the present embodiment, the number of searches can be reduced by hierarchically performing such a search, so that the calculation cost can be reduced while maintaining the estimation performance of the sound source localization. [0132] Second Embodiment In the first embodiment, an example has been described in which the number-of-tiers calculation unit 107 calculates the number of tiers S for which the total number of searches is minimum and the granularity ds. In this embodiment, an example is described in which interpolation is performed between the direction と 1 and the direction ψ2 of the transfer function, and the search is also performed in consideration of the interpolated direction. [0133] FIG. 8 is a diagram for explaining the calculation procedure of the number of layers and the interval according to the present embodiment according to the present embodiment. 03-05-2019 33 The difference from FIG. 7 is that the search cost is used. The search cost is the cost of the search. Also, let ct be the calculation cost for searching for one search point. Further, among the search points, there are search points that require interpolation, and the calculation cost for the interpolation of one point is taken as CI. In addition, a search point is a point for every space ¦ interval d1 in a 1st hierarchy in FIG. The calculation cost due to layering is the sum of the cost for interpolation (hereinafter referred to as interpolation cost) in addition to the search cost. The search cost is a value obtained by multiplying the number of searches by the calculation cost ct required to search one search point. [0134] In the following description, known values are the search range d0 of the first layer and the search interval dS of the Sth layer. Further, the values to be obtained are the number of layers S at which the calculation cost is minimum and the search interval ds in each layer. As shown in FIG. 8, the search cost of the first layer is (d0 / d1) ct, the search cost of the second layer is (d1 / d2) ct, and the search cost of the Sth layer is Is (dS-1 / dS) ct. In the present embodiment, the number-of-tiers calculation unit 107 calculates the number of tiers S that minimizes the calculation cost due to tiering and the granularity ds. [0135] FIG. 9 is a diagram for explaining the search cost and the interpolation cost of the s-th layer according to the present embodiment. First, the number of searches in the s-th layer is ds−1 / ds, and the search cost is (ds−1 / ds) ct. The search cost (ds-1 / ds) ct is a fixed value. 03-05-2019 34 [0136] Next, the interpolation cost of the s-th layer will be described. In the s-th layer, the interval (granularity) is ds, and the search point is ds-1. Interpolation is assumed to be required at IS points among the search points. In FIG. 8, search points indicated by black circles indicate points that need not be interpolated, and search points indicated by white circles indicate points that need to be interpolated. The interpolation cost in this case is IScI. Next, if all search points do not need to be interpolated, the interpolation cost is zero. Next, if all the search points need to be interpolated, the interpolation cost is (ds-1 / ds) cI. Therefore, the range of interpolation cost range IscI of the s-th layer is 0 or more and (ds-1 / ds) cI or less. Here, cs is 0 or more and cI or less, and the interpolation cost is a fixed value of (ds−1 / ds) cs. [0137] Therefore, since the calculation cost of the s-th layer is the sum of the search cost and the interpolation cost, it is expressed by the following equation (40). [0138] [0139] For this reason, the calculation cost G <˜> (S) in all the layers is expressed by the following equation (41). [0140] [0141] In formula (41), i is 1 or more and S or less. Further, in the case of only the search cost in Equation (41), it corresponds to ci = 0. Further, ct is set to 1 so that the total number of searches G (S) in the equation (37) described in 03-05-2019 35 the first embodiment and G <〜> (S) in the case of ci = 0 become equal. Normalize. By normalization of ct, ct + ci in the right side of equation (41) is represented as 1 + ci. This (1 + ci) is newly set as a variable Ci. However, the variable Ci is 1 or more. By replacing ct + ci with the variable Ci, the equation (41) becomes the following equation (42). [0142] [0143] (When the Number of Layers is Two) First, the case where the number of layers is two will be described. The total number of searches F (d1) in this case is (d0 / d1) C1 + (d1 / dS) CS. In the total number of searches F (d1), only d1 is a variable. In order to calculate the minimum value of the total number of searches F (d1), partial differentiation of the total number of searches F (d1) by d1 results in the following equation (43). [0144] [0145] The d1 at which the equation (43) becomes 0 is √ (C1d0dS / CS). The number of searches in the first layer at this time is the following equation (44), and the number of searches in the second layer is the following equation (45). [0146] 03-05-2019 36 [0147] [0148] From the equation (44) and the equation (45), the calculation cost from d0 to dS can be minimized by equalizing the interval (granularity) weighted by C1 and CS. Similarly, when the number of layers is s, the condition for minimizing the calculation cost is that the calculation costs (di-1 / di) Ci of the layers weighted by Ci are equal. However, i is 1 or more and s or less. [0149] (When the number of layers is S) Next, when the number of layers is S, the calculation cost G <˜> (S) of equation (42) can be expressed as (d0 / d1) C1 = (d1 / d2) C2 = · · · Determined under the condition of = (d (S-1) / dS) CS. If this conditional expression is modified, it is expressed as the following expression (46). [0150] [0151] From equation (46), the particle size dS-1 at which the calculation cost is minimized is the following equation (47). [0152] [0153] From the equation (47), the calculation cost G <˜> (S) becomes the following equation (48). 03-05-2019 37 [0154] [0155] Next, in order to obtain S which minimizes the equation (48), partial differentiation of the equation (48) by S yields the following equation (49). [0156] [0157] S where equation (49) becomes 0 is equation (50). [0158] [0159] Therefore, when the number of layers is S, the interval (grain size) ds at which the calculation cost is minimum is the following equation (51). [0160] [0161] As described above, the hierarchy number calculation unit 107 (see FIG. 3) calculates the hierarchy number S at which the calculation cost is minimum using equation (50), and the calculation cost is minimum using equation (51). Calculate the interval (particle size) ds. [0162] [Experimental Results] Next, the sound source direction estimation apparatus 11 according to the first embodiment or the second embodiment is implemented on the open source robot hearing software HARK (HRI-JP Audition for Robots with Kyoto University) and evaluated. Explain the results. 03-05-2019 38 Moreover, in the following evaluation, the sound source direction estimation apparatus 11 of 1st Embodiment was attached to a humanoid robot, and was evaluated. FIG. 10 is a diagram for explaining the evaluation conditions. As shown in FIG. 10, the evaluation condition is that the sampling frequency is 16 kHz, the number of FFT points is 512, the shift length is 160, the number of microphones is eight, and the microphones are in a circular array state Installed. The evaluation condition is that the number of speakers is 4, the distance to the speaker is 1.5 m, and the size of the evaluated room is 7 x 4 m. The reverberation time of the room where I went is 0.2 [sec]. [0163] FIG. 11 is a diagram for explaining a search point in evaluation. In FIG. 11, the left-right direction of the drawing is the Y direction, and the direction perpendicular to the Y direction is the X direction. Reference numeral 301 denotes a robot to which the sound source direction estimation device 11 is attached, and reference numerals 311 to 314 denote speakers, ie, sound sources. The sound source 311 is at a position of −60 degrees clockwise about the X axis, the sound source 312 is at a position of −20 degrees clockwise about the X axis, and the sound source 313 is on the X axis , And the sound source 314 is positioned at 60 [deg] counterclockwise with respect to the X-axis. [0164] 03-05-2019 39 As for the search point, the azimuth ψ1 was fixed at 0 degrees, and the azimuth ψ2 was evaluated using 30, 60, 90 and 120 degrees. Also, the interpolation point between azimuths is 1 degree, and the error e <-> [ψ1, ψ2] with the 1 degree transfer function obtained by measurement is calculated by using the following equation (52) Calculated. [0165] [0166] In equation (52), f (ω [k], ψ ^ [i]) is an interpolation error of a frequency bin corresponding to the frequency ω [k] at the interpolation point ψ ^ [i]. In addition, in evaluation, kl and kh of Formula (52) were selected so that the frequency (omega) which is a frequency low region used by sound source localization might be in the range of 500 [Hz] or more and 2800 [Hz] or less. In equation (52), i ψ is the number of interpolation points per degree in which 範 囲 is larger than ψ 1 and smaller than ψ 2. In the following evaluation, phase estimation error (PEE), spectral distortion (SD), and signal-todistortion ratio (SDR) were used as error indices of f (ω [k], ψ ^ [i]). PEE is expressed by the following equation (53) and is a phase error index. [0167] [0168] SD is expressed by the following equation (54) and represents an error amplitude. 03-05-2019 40 [0169] [0170] SDR is expressed by the following equation (55) and indicates the error of the transfer function itself. [0171] [0172] Hereinafter, let e1 <->, e2 <->, and e3 <-> be e <-> [ψ1, ψ2] when using PEE, SD, and SDR, respectively. Further, the linearity of the interpolation coefficient DA was evaluated using the following equation (56). [0173] [0174] In the equation (56), the interpolation coefficient DA [i] is a value at which the interpolation error is minimized in DA. In equation (56), the fact that the interpolation coefficient DA is close to linear with respect to the interpolation point means that the interpolation coefficient DA can be determined by (ψ ^ −ψ2) / (ψ1−ψ2), which means that it is practical. There is. Hereinafter, d1 <−>, d2 <−>, and d3 <−> will be d <−> [ψ1, ψ2] when using PEE, SD, and SDR, respectively. 03-05-2019 41 [0175] FIG. 12 is a diagram showing an example of the evaluation result of the error of the transfer function and the linearity of the interpolation coefficient when using PEE, SD, and SDR when the azimuth ψ2 is changed. FIG. 12 is an example of the result of evaluation using each index for the FDLI method, the TDLI method, and the FTDLI method. In FIGS. 12A to 12F, the horizontal axis is a relative angle, and the vertical axis is an average value of errors. Fig.12 (a) is an example of the evaluation value of e1 <->, FIG.12 (b) is an example of the evaluation value of e2 <->, FIG.12 (c) is e3 <-> It is an example of an evaluation value. FIG.12 (d) is an example of evaluation value of d1 <->, FIG.12 (e) is an example of evaluation value of d2 <->, and FIG.12 (f) is d3 <->. It is an example of an evaluation value. [0176] As shown in FIGS. 12 (a) to 12 (c), in the case of using the FTDLI method of the present invention, the FDLI method and the TDLI method are used in all of e1 <-,> e2 <-> and e3 <->. There is less error compared to using. Further, as shown in FIGS. 12 (d) to 12 (f), in the case of using the FTDLI method also in the linearity of DA, in all of d1 <-,> d2 <-> and d3 <-> There is less error as compared with the FDLI method and the TDLI method. This means that, even in the linearity of DA, the use of the FTDLI method is most nearly linear. [0177] FIG. 13 is a diagram showing an example of the average error of the direction of arrival estimation of the sound source depending on the presence or absence of interpolation. FIG. 13 is 03-05-2019 42 an example of the result of evaluation of the average error of the estimation of the direction of arrival using no interpolation (NONE), the FDLI method, the TDLI method, and the FTDLI method. In FIG. 13, the horizontal axis is the measurement interval (relative angle) of the transfer function, and the vertical axis is the average error of the arrival direction estimation. In addition, evaluation reproduces white noise every 1 degree (however, however, the direction ψ is a range of -90 degrees or more and 90 degrees or less), and calculates an error between the reproduction direction of the reproduced sound source and the estimation direction. evaluated. In the evaluation, change the measurement interval of the transfer function so that the azimuth ψ is 1 degree, 5 degrees, 10 degrees, 30 degrees, 60 degrees, 90 degrees, and 120 degrees, and further interpolate the transfer function of 1 degree interval Generated. [0178] As shown in FIG. 13, at any relative angle, the method in the case of performing any interpolation has a smaller average error than in the case of not performing the interpolation. Furthermore, the average error is smaller when the FTDLI method is used than when the interpolation is not performed, the FDLI method or the TDLI method is used and the interpolation is performed. Thus, according to the present invention, transfer functions between intervals are generated by interpolation using transfer functions generated and stored in advance at intervals of 30 degrees, whereby transfer functions at intervals of one degree can be obtained. The same accuracy was obtained as using the source direction estimation. [0179] Next, an example of the result of evaluating the calculation cost due to the hierarchical search will be described. The evaluation changed the number of sound sources, and when processing for 1000 frames was performed, the average processing time concerning calculation of sound source localization (formula (3)-formula (7)) was computed. In addition, the hierarchy of sound source search was made into 2 hierarchy, and after searching a 10-degree space ¦ interval in a 1st hierarchy, a 1-time space ¦ interval search was performed in a 2nd hierarchy. Moreover, the sound source was evaluated about the case of one piece, two pieces, three pieces, and four pieces. FIG. 14 is a diagram showing an example of the result of evaluation of the calculation cost. In FIG. 14, the second row shows the operation time when the search is performed without hierarchy (without hierarchy), and the third row is when the search is hierarchical with the present invention (with hierarchy) The calculation time of 03-05-2019 43 [0180] As shown in FIG. 14, when there is one sound source, the computation time in the case of no layering is 24.1 [msec], and the computation time in the case of layering is 6.8 [msec]. When there are two sound sources, the calculation time in the case of no hierarchy is 22 [msec], and the calculation time in the case of hierarchy is 7.7 [msec]. When there are three sound sources, the calculation time in the case of no layering is 19.6 [msec], and the calculation time in the case of layering is 8.3 [msec]. When there are four sound sources, the calculation time in the case of no hierarchy is 17.5 [msec], and the calculation time in the case of hierarchy is 8.7 [msec]. As described above, even in the state where four sound sources exist and are uttered simultaneously, the processing ends at a frame period of 10 [msec] or less. Further, as shown in FIG. 14, when hierarchically searching as in the present embodiment, the average processing time is reduced to 50% or less compared to the case where searching is performed without hierarchical regardless of the number of sound sources. is made of. As described above, according to the present invention, it is possible to improve the processing efficiency of estimating the sound source direction. [0181] FIG. 15 is a diagram showing an example of the result of speech recognition of an acoustic signal separated for each sound source. In FIG. 15, the horizontal axis is a relative angle, and the vertical axis is a word accuracy rate (WCR). The speech used for evaluation used speech data of 10 people x 216 words in the ATR speech database to recognize speeches simultaneously uttered by four speakers. The transfer functions used for the evaluation are the same intervals as in FIG. Interpolation was performed on this transfer function to estimate transfer functions at one degree intervals. In the evaluation, the WCR was evaluated when the interval of the estimated transfer function was changed. Evaluation was evaluated about no interpolation (NONE), FDLI method, TDLI method, and FTDLI method similarly to FIG. [0182] As shown in FIG. 15, when the interpolation is not used, the recognition performance is degraded when the transfer function exceeds a 30 degree interval. In addition, when FTDLI method is used as the interpolation method, recognition performance can be maintained as compared with other interpolation methods. For example, when the transfer function is 90 degrees apart, the recognition rate is improved by about 7% as compared to the FDLI method. 03-05-2019 44 [0183] As described above, when the sound source direction estimation device 11 according to the first embodiment or the second embodiment is applied to the sound processing device 1, a transfer function is used in the case where the word accuracy rate is fine spatial resolution for each degree. The amount of calculation involved in the search can be reduced while maintaining the same performance as in the case. As a result, it is possible to accurately estimate the sound source direction in real time for each frame. [0184] In the first and second embodiments, an example has been described in which the matrix calculation unit 106 calculates the GSVD matrix and estimates the direction of the sound source based on the calculated GSVD matrix, but the present invention is not limited thereto. . The matrix calculation unit 106 may calculate a GEVD (Generalized Eigenvalue Expansion) matrix and estimate the sound source direction based on the calculated GSVD matrix. Also in this case, as described in the first embodiment or the second embodiment, the sound source direction estimation device 11 calculates the number of layers and the search interval (granularity) that minimize the total number of searches, or the calculation cost Calculate the number of layers and the particle size that minimizes. Then, based on the calculated number of layers and the search interval (granularity), the sound source direction estimation apparatus 11 first performs a search at a coarse search interval, and selects the next search range from the search range. The sound source direction estimation device 11 narrows the search interval in the selected search range. The sound source direction estimation device 11 can reduce the calculation cost while accurately estimating the sound source direction. [0185] In the first embodiment and the second embodiment, although the example in which the numberof-layers calculation unit 107 calculates both the number of layers and the search interval used for the search has been described, the present invention is not limited thereto. The number of layers may be selected in advance by the user of the apparatus, and only the search interval may be calculated. 03-05-2019 45 [0186] In the first and second embodiments, the FDLI method, the TDLI method, and the FTDLI method are used as techniques for interpolating the transfer function. However, the present invention is not limited to this. Other techniques may be used to interpolate the transfer function. [0187] A program for realizing the function of the sound source direction estimation device 11 according to the present invention is recorded in a computer readable recording medium, and the computer system reads the program recorded in the recording medium and executes the program. An estimate of the direction may be made. Here, the computer system includes an OS and hardware such as peripheral devices. The "computer system" also includes a WWW system provided with a homepage providing environment (or display environment). The term "computer-readable recording medium" refers to a storage medium such as a flexible disk, a magneto-optical disk, a ROM, a portable medium such as a ROM or a CD-ROM, or a hard disk built in a computer system. Furthermore, the "computer-readable recording medium" is a volatile memory (RAM) in a computer system serving as a server or a client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. In addition, those that hold the program for a certain period of time are also included. [0188] The program may be transmitted from a computer system in which the program is stored in a storage device or the like to another computer system via a transmission medium or by transmission waves in the transmission medium. Here, the transmission medium for transmitting the program is a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the program may be for realizing a part of the functions described above. Furthermore, it may be a so-called difference file (difference program) that can realize the above-described functions in combination with a program already recorded in the computer system. [0189] 03-05-2019 46 DESCRIPTION OF SYMBOLS 1 ... sound processing apparatus, 11 ... sound source direction estimation part, 12 ... sound source localization part, 13 ... sound source separation part, 14 ... sound feature quantity extraction part, 15 ... speech recognition part 16 recognition result output unit 101 speech input unit 102 short time Fourier transform unit 103 first correlation matrix calculation unit 104 noise database 105 Second correlation matrix calculation unit 106 106 matrix calculation unit 107 hierarchy number calculation unit 108 first space spectrum calculation unit 109 second space spectrum calculation unit 110 Peak search unit, 111 ... STF processing unit, 112 ... Transfer function storage unit, 113 ... Output unit S ... Number of layers, dS ... Search interval (granularity), A ... Transfer function 03-05-2019 47

© Copyright 2021 DropDoc