Patent Translate Powered by EPO and Google Notice This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate, complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or financial decisions, should not be based on machine-translation output. DESCRIPTION JP2018191255 Abstract: [PROBLEMS] To provide a sound collection device etc. capable of extracting a target sound more accurately than in the prior art using the MVDR method. A sound collection device calculates a spatial correlation matrix for each frequency using microphone signals in the frequency domain of N channels, and estimates values of strengths of incoming waves from K directions from the spatial correlation matrix and each microphone An estimated value of noise power included in the signal is determined, an estimated value kt of the arrival direction of the target sound is determined, and a matrix A (f) <H> consisting of K vectors ak (f) and an N × N unit matrix; Using a matrix Vs (f, l) in which elements other than the elements of (kt, kt) in the diagonal matrix whose diagonal components are the estimated value of the intensity and the estimated value of the noise power are 0, The estimated value of the correlation matrix and the estimated value of the correlation matrix of the non-target sound are determined, the filter coefficient vector is determined using the estimated value of the correlation matrix, the filter coefficient vector is applied to the microphone signal, and the output signal is determined. [Selected figure] Figure 2 Sound collecting device, method thereof and program [0001] The present invention relates to a sound collection device using a beamforming technique for forming a beam using a plurality of microphones, a method thereof, and a program. [0002] 03-05-2019 1 A plurality of microphones are installed in the sound field to acquire multi-channel microphone signals, from which the target voice and sound (hereinafter also referred to as target sound) are cleared, and noise and other voice (hereinafter also non-target sound) The need for technology to remove and extract as much as possible is increasing in recent years. For this purpose, beamforming techniques for forming beams using a plurality of microphones have been extensively researched and developed in recent years. [0003] In the beam forming technique, as shown in FIG. 1, the filtering unit 92-n applies a filter to each microphone signal yn (t) collected by N microphones 91-n (where n = 1, 2,..., N). Apply Here, t is an index indicating time. Next, the adder 93 sums the output values of the filtering unit 92-n. The calculated sum is output as the output signal z (t) of the sound collection device. With such a configuration, noise can be significantly reduced and the target sound can be extracted more clearly. As a method for obtaining such a beamforming filter, the minimum variance distortionless response method (MVDR method) is often used (see Non-Patent Document 1). [0004] Habets, E., Benesty, J., Cohen, I., Gannot, S., Dmochowski, J., "New Insights Into the MVDR Beamformer in Room Acoustics", IEEE Transactions on Audio, Speech, and Language Processing, 18, 1, pp. 158 - 170, 2010. [0005] In order to use the MVDR method, it is necessary to appropriately estimate the correlation matrix of sounds other than the target sound (non-target sound) and the transfer characteristics from the sound source position of the target sound to each microphone. However, the components derived from the target sound and the components derived from the non-target sound are mixed in the plurality of microphone signals, and the desired correlation matrix and transfer characteristics can not be extracted as it is. Therefore, the MVDR method alone can not clearly extract the target voice from only the microphone signal. 03-05-2019 2 [0006] Therefore, in the present invention, it is an object of the present invention to provide a sound collection device that estimates each correlation matrix from a microphone signal in which target sound and non-target sound are mixed, and extracts the target sound using the MVDR method, its method and program. I assume. [0007] In order to solve the above problems, according to one aspect of the present invention, the sound collection device sets N and K to be any integers of 2 or more, and n = 1, 2,. Space correlation matrix R (f, l) is calculated for each frequency using microphone signals Y n (f, l) in the frequency domain of N channels as,, ..., K, and spatial correlation matrix R (f, l) The arrival wave decomposition unit which obtains the estimated value pk (f, l) of the strength of the incoming wave from K directions from k and the estimated value qn (f, l) of the noise power included in each microphone signal Yn (f, l) And a target sound determination unit for obtaining an estimated value kt of the arrival direction of the target sound, and a vector consisting of an output signal of the microphone array when a plane wave of amplitude 1 reaches the microphone array consisting of N microphones from the kth direction. Let ak (f) be a matrix A (f) <H> = [a1 (f) a2 (f) ... aK (f) IN] consisting of K vectors ak (f) and an N × N unit matrix IN , Estimate of intensity pk (f, l) and noise power Using a matrix Vs (f, l) in which all elements other than the elements of (kt, kt) in the diagonal matrix V (f, l) have constant values qn (f, l) as diagonal components A correlation matrix synthesis unit for obtaining an estimated value R ^ T (f, l) of the correlation matrix of the target sound and an estimated value R ^ NT (f, l) of the correlation matrix for the non-target sound, and an estimated value R ^ of the correlation matrix Find the filter coefficient vector h (f, l) using T (f, l) and R ^ NT (f, l) and apply the filter coefficient vector h (f, l) to the microphone signal Yn (f, l) And an array filtering unit for obtaining an output signal z (f, l). [0008] In order to solve the above problems, according to another aspect of the present invention, the sound collection method makes N and K each be any integer of 2 or more, and n = 1, 2,. Space correlation matrix R (f, l) is calculated for each frequency using microphone signals Y n (f, l) in the frequency domain of N channels as 1, 2, ..., K, and spatial correlation matrix R (f, l) Arrival wave decomposition to obtain estimated value pk (f, l) of the strength of the arrival wave from K directions) and estimated value qn (f, l) of noise power included in each microphone signal Yn (f, l) It consists of the step, the target sound judgment step of finding the estimated value kt of the direction of arrival of the target sound, and the output signal of the microphone array when the plane wave of amplitude 1 reaches the microphone array consisting of N microphones from the kth direction. Let the vector be ak (f), and a matrix A (f) <H> = [a1 (f) a2 (f) ... aK (f) IN] consisting of K vectors ak (f) and an N × N unit matrix IN. And the estimated value pk (f, l) of the intensity 03-05-2019 3 A matrix Vs (f, l) in which all elements other than the elements of (kt, kt) of the diagonal matrix V (f, l) having the estimated value qn (f, l) of Ispower as a diagonal component A correlation matrix synthesis step for obtaining an estimated value R ^ T (f, l) of the correlation matrix of the target sound and an estimated value R ^ NT (f, l) of the correlation matrix of the non-target sound using the estimation of the correlation matrix A filter coefficient vector h (f, l) is obtained using the values R ^ T (f, l) and R ^ NT (f, l), and a filter coefficient vector h (f, l) is obtained for the microphone signal Yn (f, l). And the array filtering step to obtain the output signal z (f, l). [0009] According to the present invention, the target sound can be extracted with higher accuracy than in the prior art by using the MVDR method. [0010] The functional block diagram of the sound collection apparatus which concerns on a prior art. FIG. 2 is a functional block diagram of the sound collection device according to the first embodiment and the second embodiment. The figure which shows the example of the processing flow of the sound collection apparatus which concerns on 1st embodiment and 2nd embodiment. [0011] Hereinafter, embodiments of the present invention will be described. In the drawings used in the following description, the same reference numerals are given to constituent parts having the same functions and steps for performing the same processing, and redundant description will be omitted. In the following description, the symbol ^ or the like used in the text should be written directly above the preceding character, but due to the limitation of the text notation, it is written 03-05-2019 4 immediately after the character. In the formula, these symbols are described at their original positions. Moreover, the processing performed in each element unit of a vector or a matrix is applied to all elements of the vector or the matrix unless otherwise noted. [0012] First Embodiment FIG. 2 shows a functional block diagram of the sound collection device 100 according to the first embodiment, and FIG. 3 shows its processing flow. [0013] The sound collection device 100 of the present embodiment receives an output signal (microphone signal) yn (t) of a microphone array consisting of N microphones 91-n. For example, the microphones 91-n are nondirectional microphone elements. N is any integer of 2 or more, and n = 1, 2,. The sound collection device 100 of this embodiment extracts the estimated value R ^ NT (f, l) of the correlation matrix of non-target sound from the microphone signal yn (t) of the N channel, and extracts the component of the target sound by the MVDR method. And output the extracted signal as an output signal z (t). [0014] The sound collection device 100 is configured by a computer provided with a CPU, a RAM, and a ROM storing a program for executing the following processing, and is functionally configured as follows. [0015] The sound collection device 100 includes a noise / arrival wave decomposition unit 101, a target sound determination unit 103, a correlation matrix synthesis unit 104, an array filtering unit 105, a Fourier transform unit 107, and an inverse Fourier transform unit 108. [0016] <Fourier Transform Unit 107> The Fourier transform unit 107 receives microphone signals yn (t) in the time domain of N channels as input, and performs short-time Fourier transform on the 03-05-2019 5 microphone signals Yn (f, l) in the frequency domain for each frame l. Convert (S107) and output. The conversion result at the frequency f, frame l [0017] [0018] Handle it like vector. The microphone signal y (f, l) consisting of the microphone signal Yn (f, l) in the frequency domain of N channels is expressed by the following equation: y (f, l) = x (f, l) + v (f, l) Is decomposed into a multi-channel signal x (f, l) consisting of direct waves of the target sound and a multi-channel signal v (f, l) consisting of its reflection and reverberation components and noise. [0019] <Noise and Arrival Wave Decomposition Unit 101> The noise and arrival wave decomposition unit 101 receives the microphone signal y (f, l) in the frequency domain and uses the microphone signal y (f, l) at the frequency f and the frame l. , Its spatial correlation matrix R (f, l) is calculated. For example, it calculates by following Formula. R (f, l) = E [y (f, l) y (f, l) <H>] (2) [0020] However, E [] means to take the expected value. Further, y (f, l) <H> is a vector obtained by transposing y (f, l) and taking a complex conjugate. In actual processing, short-time averaging is usually used instead of E []. Then, estimated values pk (f, l) of the strengths of the arriving waves from the K directions from the spatial correlation matrix R (f, l) and estimated values q n of noise 03-05-2019 6 power contained in each microphone signal Y n (f, l) (f, l) is obtained (S101), and a diagonal matrix V (f, l) having pk (f, l) and qn (f, l) as diagonal components is output. However, k is an index of the direction of arrival, K direction is assumed as the arrival possible direction of the plane wave, and k = 1, 2,. Therefore, the diagonal matrix V (f, l) is expressed as follows. [0021] [0022] Note that K> N. As a method of estimating the estimated value pk (f, l) of the intensity and the estimated value qn (f, l) of the noise power, for example, the method of reference 1 can be used. (Reference 1) P. Stoica, P. Babu, and J. Li, "SPICE A sparse covariance-based estimation method for array processing", IEEE Transactions on signal processing, vol. 59, no. 2, 2011, 629-638. [0023] In this method, it is assumed in advance that the K direction (> N) is a direction in which plane waves can arrive. When a plane wave of amplitude 1 reaches the microphone array from the kth direction at frequency f, the response (output signal) of each microphone is ak (f) = [ak, 1 (f) ak, 2 (f) ... Let ak, N (f)] <T>. ak, n (f) represents the response (output signal) of the n-th microphone to the incoming plane wave of amplitude 1 from the k-th direction at the frequency f. Note that ak (f) is obtained in advance prior to sound collection. However, ak (f) may be obtained in advance by experiment (actual measurement) or simulation, or a theoretical value by calculation may be used. Using a matrix A (f) <H> = [a1 (f) a2 (f) ... aK (f) IN] consisting of K response vectors ak (f) and an N × N identity matrix IN, (3) In reference 1, the matrix R (f, l) is converted to the matrix A (f) <H> in the form of R (f, l) = A (f) <H> V (f, l) A (f) (4) , And is decomposed into the product of diagonal matrix V (f, l) and matrix A (f). By this decomposition, an estimated value pk (f, l) of the intensity of the plane wave from the k-th direction included in the diagonal matrix V (f, l) and an estimated value qn of the noise power of the n-th microphone 91-n (f, l) are obtained. In practice, the above decomposition is ¦¦ (A (f) <H> V (f, l) A (f)) <-1/2> (R (f, l) -A (f) < H> V (f, l) A (f)) R (f, l) <-1/2> ¦¦ <2> (5) To find the diagonal matrix V (f, l) that minimizes It corresponds. In this equation (5), ¦¦ x ¦¦ means taking the Frobenius norm of the matrix x. 03-05-2019 7 [0024] <Target Sound Determination Unit 103> The target sound determination unit 103 obtains an estimated value kt of the arrival direction of the target sound (S103), and outputs it. For example, target sound determination section 103 receives diagonal matrix V (f, l) as input, and estimates estimated value pk (f, l) of strength in each direction of arrival k included in diagonal matrix V (f, l). The direction in which the intensity is the largest is determined to be the arrival direction of the target sound (S103), and the determination result (estimated value of the arrival direction) kt is output. In this example, the target sound determination unit 103 obtains the estimated value kt of the arrival direction of the target sound using the estimated value pk (f, l) of the intensity in the band 100 to 500 Hz in which the voice power is concentrated. In this band, the strength of each arrival direction k is [0025] [0026] になる。 In this example, f0 corresponds to 100 Hz and f1 corresponds to 500 Hz. It is determined that k that maximizes b (k, l) is the arrival direction kt of the target sound at frame l. The incoming wave from direction kt is regarded as the target sound, and the incoming wave from directions other than kt is regarded as the non-target sound. [0027] <Correlation Matrix Combining Unit 104> The correlation matrix combining unit 104 receives the diagonal matrix V (f, l) and the estimated value kt of the arrival direction as input, and the matrix A (f) <H> = [a1 (f) a2 ( f) ... aK (f) IN], and a matrix Vs (f, l) (diagonal matrix V (f (f, l)) with elements other than the elements of (kt, kt) of the diagonal matrix V (f, l) all 0 , l) is a matrix in which all elements other than the element of (kt, kt) are made 0 among K pk (f, l) included, and it comes from (the estimated value of) the arrival direction of the target sound By using the matrix Vs (f, l) with only the intensity of the sound (an estimate of the sound) and the other elements set 03-05-2019 8 to 0, the estimated value R ^ T (f, l) of the correlation matrix of the target sound is not An estimated value R ^ NT (f, l) of the correlation matrix of the target sound is determined (S104) and output. As described above, ak (f) is obtained in advance prior to sound collection. [0028] For example, the correlation matrix synthesis unit 104 obtains the estimated value R ^ T (f, l) of the correlation matrix of the target sound and the estimated value R ^ NT (f, l) of the correlation matrix of the non-target sound by the following equation . R ^ T (f, l) = A (f) <H> Vs (f, l) A (f) R ^ NT (f, l) = A (f) <H> (V (f, l)- Vs (f, l)) A (f) (7) [0029] <Array Filtering Unit 105> The array filtering unit 105 estimates the microphone signal y (f, l) in the frequency domain, the estimated value R ^ T (f, l) of the correlation matrix of the target sound, and the estimated value of the correlation matrix of the non-target sound. Take R ^ NT (f, l) as input. [0030] First, using the estimated value R ^ T (f, l) of the correlation matrix of the target sound and the estimated value R ^ NT (f, l) of the correlation matrix of the non-target sound, a filter coefficient vector (N-dimensional complex vector) h ( Find f, l). For example, a filter coefficient vector is determined using the MVDR method based on the estimated value R ^ NT (f, l) of the correlation matrix estimated. The MVDR method solves the following constrained optimization problem and obtains its filter coefficient vector h (f, l). [0031] [0032] Here, g <(d)> (f) is a vector consisting of frequency transfer characteristics of the direct path from 03-05-2019 9 the sound source position of the target sound to each microphone. g1 <(d)> (f) is the frequency transfer characteristic of the direct path from the sound source position of the target sound to the microphone 91-1 as a reference. In this example, the microphone serving as the reference is the microphone 91-1, but any one of the other microphones 91-2 to 91-N may be used as the reference. [0033] This constraint condition should be rewritten using the sound source signal S (f, l) (of the target sound) and the signal component X1 <(d)> (f, l) that directly reaches the microphone 91-1 from the sound source of the target sound. Can. Note that X1 <(d)> (f, l) = g1 <(d)> (f) S (f, l). [0034] From the right, S (f, l) X1 <(d) *> (f, l) is multiplied from the right to the above constraint condition equation to obtain an expected value. However, superscript * means taking complex conjugate. The rewritten constraint condition is h <H> (f, l) E [x <(d)> (f, l) X1 <(d) *> (f, l)] = E [X1 <(d)> (f, l) X1 <(d) *> (f, l)] (9) However, x <(d)> (f, l) is a vector of signal components that directly reach each microphone from the sound source of the target sound. [0035] Here, E [] on the left side of Expression (9) is a first vertical vector of the estimated value R ^ T (f, l) of the correlation matrix of the target sound. Further, E [] on the right side is the (1, 1) element of the estimated value R ^ T (f, l) of the correlation matrix of the target sound. The source signal S (f, l) of the target sound and the frequency transfer characteristic g <(d)> (f) from the source of the target sound to each microphone are unknown. However, the statistical procedure taking the above expected values enables the coefficients of the new constraint conditions to be obtained from the estimated value R ^ T (f, l) of the target sound correlation matrix. [0036] 03-05-2019 10 The array filtering unit 105 applies the obtained filter coefficient vector h (f, l) to the microphone signal y (f, l) (see the following equation) to obtain an output signal z (f, l) (S105), and outputs Do. z (f, l) = h <H> (f, l) y (f, l) (15) With such a configuration, it is possible to extract the component of the frequency f of the target sound. [0037] <Inverse Fourier Transform Unit 108> The inverse Fourier transform unit 108 receives the output signal z (f, l) in the frequency domain, performs inverse Fourier transform on the processing result at all frequencies for a short time (S108), and outputs the time domain The signal z (t) is obtained and output. [0038] <Effects> With the above configuration, it is possible to estimate each correlation matrix from the microphone signal in which the target sound and the non-target sound are mixed, and to extract the target sound using the MVDR method. [0039] Modified Example In the present embodiment, the diagonal matrix V (f, l) is output from the noise / arrival wave decomposition unit 101 to the target sound determination unit 103, but is included in the diagonal matrix V (f, l). Only the estimated value pk (f, l) of the target intensity may be output. The point is that the target sound determination unit 103 only needs to be able to determine the arrival direction of the target sound. [0040] In this embodiment, the sound collection device 100 includes the Fourier transform unit 107 and the inverse Fourier transform unit 108. However, the Fourier transform unit 107 and the inverse Fourier transform unit 108 are separate devices, and the sound collection device 100 has a frequency The microphone signal y (f, l) of the area may be input, or the output signal z (f, l) of the frequency domain may be output. 03-05-2019 11 [0041] Second Embodiment A description will be made focusing on parts different from the first embodiment. [0042] FIG. 2 shows a functional block diagram of the sound collection device 200 according to the second embodiment, and FIG. 3 shows its processing flow. [0043] The sound collection device 200 includes a noise / arrival wave decomposition unit 101, an intensity correction unit 202 (indicated by a broken line in FIG. 2), a target sound determination unit 103, a correlation matrix synthesis unit 204, an array filtering unit 105, a Fourier transform unit 107, The inverse Fourier transform unit 108 is included. [0044] <Intensity Correction Unit 202> The intensity correction unit 202 receives the diagonal matrix V (f, l) as input, and the total signal power of the spatial correlation matrix R (f, l) and the matrix A (f) <H> = [ a1 (f) a2 (f)... aK (f) IN] and the total signal power obtained from the diagonal matrix V (f, l) to obtain a correction coefficient β (f, l) (S202, in FIG. 3) , Shown by dashed lines), output. [0045] For example, in the intensity correction unit 202, the correction coefficient β (f, l) is obtained by the following equation. [0046] [0047] Where tr () is a function that takes a matrix trace. For example, the spatial correlation matrix R (f, l) may be the one calculated by the noise / arrival decomposing unit 101, and the matrix A (f) <H> precedes sound collection by the method described in the first embodiment. What has been obtained in advance may be used. 03-05-2019 12 [0048] <Correlation Matrix Synthesizer 204> The correlation matrix synthesizer 204 estimates the spatial correlation matrix R (f, l), the correction coefficient β (f, l), the diagonal matrix V (f, l), and the estimated value of the direction of arrival. A spatial correlation matrix R (f, l), a correction coefficient β (f, l), and a matrix A (f) <H> = [a1 (f) obtained in advance prior to sound collection, with kt as an input Correlation using a2 (f) ... aK (f) IN] and a matrix Vs (f, l) in which all elements other than the elements of (kt, kt) in the diagonal matrix V (f, l) are 0 An estimated value R ^ NT (f, l) of the matrix is obtained (S204) and output. For example, the estimated value R ^ NT (f, l) of the correlation matrix is obtained by the following equation. R ^ NT (f, l) = R (f, l)-β (f, l) A (f) <H> Vs (f, l) A (f) Note that the estimated value R of the correlation matrix of the target sound ^ T (f, l) can be obtained by the same method as in the first embodiment. [0049] <Effects> With this configuration, the same effects as those of the first embodiment can be obtained. Furthermore, by using the correction coefficient, the correlation matrix of non-target sound can be better determined. [0050] <Other Modifications> The present invention is not limited to the above embodiment and modifications. 03-05-2019 13 For example, the various processes described above may be performed not only in chronological order according to the description, but also in parallel or individually depending on the processing capability of the apparatus that executes the process or the necessity. In addition, changes can be made as appropriate without departing from the spirit of the present invention. [0051] <Program and Recording Medium> Further, various processing functions in each device described in the above-described embodiment and modification may be realized by a computer. In that case, the processing content of the function that each device should have is described by a program. By executing this program on a computer, various processing functions in each of the abovedescribed devices are realized on the computer. [0052] The program describing the processing content can be recorded in a computer readable recording medium. As the computer readable recording medium, any medium such as a magnetic recording device, an optical disc, a magneto-optical recording medium, a semiconductor memory, etc. may be used. [0053] Further, this program is distributed, for example, by selling, transferring, lending, etc. a portable recording medium such as a DVD, a CD-ROM or the like in which the program is recorded. Furthermore, the program may be stored in a storage device of a server computer, and the program may be distributed by transferring the program from the server computer to another computer via a network. 03-05-2019 14 [0054] For example, a computer that executes such a program first temporarily stores a program recorded on a portable recording medium or a program transferred from a server computer in its own storage unit. Then, at the time of execution of the process, the computer reads the program stored in its storage unit and executes the process according to the read program. In another embodiment of the program, the computer may read the program directly from the portable recording medium and execute processing in accordance with the program. Furthermore, each time a program is transferred from this server computer to this computer, processing according to the received program may be executed sequentially. In addition, a configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes processing functions only by executing instructions and acquiring results from the server computer without transferring the program to the computer It may be Note that the program includes information provided for processing by a computer that conforms to the program (such as data that is not a direct command to the computer but has a property that defines the processing of the computer). [0055] In addition, although each device is configured by executing a predetermined program on a computer, at least a part of the processing content may be realized as hardware. 03-05-2019 15

© Copyright 2021 DropDoc