close

Вход

Забыли?

вход по аккаунту

JP2006254226

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2006254226
An acoustic signal processing apparatus, an acoustic signal processing method, an acoustic signal
processing program, and an acoustic signal processing program for sound source localization
and sound source separation capable of handling sound sources more than the number of
microphones while alleviating restrictions on sound sources Providing a recording medium on
which SOLUTION: Two amplitude data of microphones 1a and 1b inputted to an acoustic signal
input unit 2 are analyzed by a frequency decomposition unit 3, and a two-dimensional data
conversion unit 4 obtains a phase difference between the two for each frequency. The phase
difference for each frequency is converted into two-dimensional data given two-dimensional
coordinate values. The figure detection unit 5 analyzes the generated two-dimensional data on
the XY plane to detect a figure. The sound source information generation unit 6 processes the
information of the detected figure, and the number of sound sources that are generation sources
of the sound signal, the spatial existence range of each sound source, the temporal existence
period of the sound generated by each sound source, each sound source The sound source
information including the component configuration of the sound, the separated voice for each
sound source, and the symbolic content of each sound source sound is generated. [Selected
figure] Figure 1
Acoustic signal processing apparatus, acoustic signal processing method, acoustic signal
processing program, and computer readable recording medium recording acoustic signal
processing program
[0001]
The present invention relates to acoustic signal processing, and more particularly to estimation
of the number of sources of sound waves propagating in a medium, the direction of each source,
the frequency component of sound waves coming from each source, and the like.
04-05-2019
1
[0002]
In recent years, in the field of auditory research for robots, a method has been proposed in which
the number and direction of a plurality of target sound sources are estimated (sound source
localization) under noise environment and each sound source is separated and extracted (sound
source separation) .
[0003]
For example, in Non-Patent Document 1 below, N source sounds are observed with M
microphones in an environment with background noise, and a spatial correlation matrix is
generated from data obtained by short-time Fourier transform (FFT) processing of each
microphone output. There is also described a method of estimating the number N of sound
sources as the number of main eigenvalues by performing an eigenvalue decomposition of this to
obtain a large main eigenvalue.
This is because the directional signal such as the source sound is mapped to the main
eigenvalues, and the non-directional background noise utilizes the property to be mapped to all
the eigenvalues.
The eigenvectors corresponding to the principal eigenvalues are the basis vectors of the signal
subspace spanned by the signal from the sound source, and the eigenvectors corresponding to
the remaining eigenvalues are the basis vectors of the noise subspace spanned by the
background noise signal. By applying the MUSIC method using the basis vector of this noise
subspace, it is possible to search for the position vector of each sound source, and with the
beamformer given directivity in the direction obtained as a result of the search, the sound source
concerned. The voice from can be extracted. However, when the number N of sound sources is
the same as the number M of microphones, the noise subspace can not be defined, and when the
number N of sound sources exceeds M, there is a sound source that can not be detected.
Therefore, the number of sound sources that can be estimated is never more than the number M
of microphones. Although this method is not particularly limited in terms of sound sources and is
a mathematically clean method, there is a limitation that in order to handle a large number of
sound sources, more microphones are required.
[0004]
04-05-2019
2
Further, Non-Patent Document 2 below describes a method of performing sound source
localization and sound source separation using a pair of microphones. This method focuses on
the harmonic structure (frequency structure consisting of the fundamental frequency and its
harmonics) specific to the voice generated through the tube (articulatory organ) like human
voice, and captures the voice signal captured by the microphone By detecting harmonic
structures with different fundamental frequencies from the Fourier-transformed data, the
number of detected harmonic structures is made the number of speakers, and the interaural
phase difference (IPD) and interaural phase difference for each harmonic structure The intensity
difference (IID) is used to estimate the direction with certainty, and the harmonic structure itself
estimates each source sound. This method can process more than the number of microphones by
detecting a plurality of harmonic structures from the Fourier transform data. However, based on
the harmonic structure based on the estimation of the number and direction of sound sources
and sound source estimation, the sound sources that can be handled are limited to those with
harmonic structures such as human voice, and various sounds can be used. It can not be dealt
with. Asano, "Separating Sound", Measurement and Control, Vol. 43, No. 4, pp. No. 325-330, April
2004 Nakajima Kazuhiro et al., "Real-time active person tracking by hierarchical integration of
audiovisual information", AI Challenge Study Group of the Japanese Society for Artificial
Intelligence, SIG-Challenge-0113-5, pp. 35-42, June 2001
[0005]
As described above, (1) the number of sound sources can not be more than the number of
microphones when no restriction is imposed on the sound sources, and (2) when the number of
sound sources is more than the number of microphones, for example, harmonic structure is
assumed for the sound sources There is a trade-off problem in that there are limitations such as,
and a method capable of handling more than the number of microphones without constraining
the sound source has not been established.
[0006]
The present invention has been made in view of the above problems, and is an acoustic signal
processing apparatus for sound source localization and sound source separation which can ease
the restriction on the sound source and can handle more than the number of microphones.
Abstract: An audio signal processing method, an audio signal processing program, and a
computer readable recording medium storing the audio signal processing program are provided.
[0007]
An acoustic signal processing device according to an aspect of the present invention decomposes
04-05-2019
3
each of the plurality of acoustic signals, and acoustic signal input means for inputting a plurality
of acoustic signals captured at two or more spatially nonidentical points; Frequency separation
means for obtaining a plurality of frequency resolved data sets representing phase values for
each frequency; phase difference calculation means for calculating a phase difference value for
each frequency in different sets of the plurality of frequency resolved data sets; For each, a twodimensional point group having coordinate values on a two-dimensional coordinate system with
a function of frequency as a first axis and a function of the phase difference value calculated by
the phase difference calculating means as a second axis Two-dimensional data generation means
for generating data, figure detection means for detecting from the two-dimensional data a figure
reflecting a proportional relationship between a frequency derived from the same sound source
and a phase difference, and a generation source of the acoustic signal The number of
corresponding sound sources, the spatial existence range of each sound source, the temporal
existence period of the sound emitted by each sound source, the component configuration of the
sound emitted by each sound source, the separated sound separated for each sound source, each
sound source Acoustic signal processing comprising: sound source information generating means
for generating sound source information on a distinguished sound source based on the figure,
the sound source information including at least one of symbolic contents of the emitted voice,
and output means for outputting the sound source information It is an apparatus.
[0008]
According to the present invention, an acoustic signal processing apparatus for sound source
localization and sound source separation capable of further relaxing the restriction on a sound
source and capable of handling sound sources more than the number of microphones, an
acoustic signal processing method, an acoustic signal processing program And a computer
readable recording medium recording the acoustic signal processing program.
[0009]
Hereinafter, an embodiment of an acoustic signal processing apparatus according to the present
invention will be described with reference to the drawings.
[0010]
FIG. 1 is a functional block diagram of an acoustic signal processing apparatus according to an
embodiment of the present invention.
The acoustic signal processing apparatus includes a microphone 1a, a microphone 1b, an
04-05-2019
4
acoustic signal input unit 2, a frequency decomposition unit 3, a two-dimensional data
conversion unit 4, a figure detection unit 5, a sound source information generation unit 6, and an
output. A unit 7 and a user interface unit 8 are provided.
[0011]
[Basic concept of sound source estimation based on phase difference for each frequency
component] The microphones 1a and 1b are two microphones disposed at a predetermined
distance in a medium such as air, and the medium vibrations at two different points. It is a means
for converting (sound wave) into an electric signal (acoustic signal).
Hereinafter, when the microphone 1a and the microphone 1b are handled collectively, this is
referred to as a microphone pair.
[0012]
The acoustic signal input unit 2 digitizes two acoustic signals by the microphone 1 a and the
microphone 1 b by periodically A / D converting the two acoustic signals by the microphone 1 a
and the microphone 1 b at a predetermined sampling period Fr. It is a means for generating
amplitude data in time series.
[0013]
Assuming that the sound source is positioned far enough compared to the distance between the
microphones, as shown in FIG. 2A, the wave front 101 of the sound wave originating from the
sound source 100 and reaching the microphone pair is substantially planar .
When this plane wave is observed at two different points by using the microphone 1a and the
microphone 1b, the microphone according to the direction R of the sound source 100 with
respect to the line segment 102 connecting the microphone 1a and the microphone 1b (referred
to as a baseline) A predetermined arrival time difference ΔT should be observed in the acoustic
signals converted in pairs.
When the sound source is sufficiently far, this arrival time difference ΔT is 0 when the sound
04-05-2019
5
source 100 exists on a plane perpendicular to the baseline 102, and this direction is defined as
the front direction of the microphone pair.
[0014]
Reference 1 "Suzuki et al.," Realization of the "call-behind" function of home robots through
audio-visual cooperation ", Proceedings of the 4th SICE System Integration Division Conference
(SI 2003), 2F 4-5, 2003" In the next step, pattern matching is performed to search for which part
of one amplitude data is similar to which part of the other amplitude data, thereby to obtain a
signal between the two acoustic signals (103 and 104 in FIG. 2 (b)). A method of deriving the
arrival time difference ΔT is described.
However, this method is effective when there is only one strong sound source, but when strong
background noise or multiple sound sources are present, similar parts are clearly displayed on
the mixed waveform of strong sound from multiple directions. It may not appear and pattern
matching may fail.
[0015]
Therefore, in the present invention, the input amplitude data is divided into phase differences for
each frequency component and analyzed.
When a plurality of sound sources are present, phase differences according to the sound source
direction are observed between two data for frequency components of the respective sound
sources. So, if we can divide the phase difference for each frequency component into groups in
the same direction without assuming strong constraints on the sound sources, there will be
several sources of sound sources for a wider variety of sources, each in either direction It should
be possible to grasp how each of them emits sound waves of characteristic frequency
components. While this theory itself is quite straightforward, there are several challenges to
overcome when analyzing actual data. Along with the problem, the functional blocks (the
frequency decomposition unit 3, the two-dimensional data conversion unit 4, and the figure
detection unit 5) for performing this grouping will be described continuously.
[0016]
04-05-2019
6
Frequency Decomposition Unit 3 Now, there is a fast Fourier transform (FFT) as a general
method of decomposing amplitude data into frequency components. As a representative
algorithm, the Cooley-Turkey DFT algorithm is known.
[0017]
As shown in FIG. 3, the frequency decomposition unit 3 extracts continuous N pieces of
amplitude data as the frame (T-th frame 111) from the amplitude data 110 from the acoustic
signal input unit 2 and performs fast Fourier transform, The extraction position is repeated while
being shifted by the frame shift amount 113 (T + 1st frame 112).
[0018]
The amplitude data constituting the frame is subjected to windowing 120 as shown in FIG. 4A,
and then subjected to the fast Fourier transform 121.
As a result, short-time Fourier transform data of the input frame is generated in the real part
buffer R [N] and the imaginary part buffer I [N] (122). An example of the windowing function
(Hamming windowing or Hanning windowing) 124 is shown in FIG. 4 (b).
[0019]
The short-time Fourier transform data generated here is data obtained by decomposing the
amplitude data of the frame into N / 2 frequency components, and the real part R [k] in the
buffer 122 for the k-th frequency component fk The numerical value of the imaginary part I [k]
represents a point Pk on the complex coordinate system 123 as shown in FIG. 4 (c). The square
of the distance of Pk from the origin O is the power Po (fk) of the frequency component, and the
signed rotation angle θ {θ: -π> θ ≧ π [radian]} from the real part axis of Pk is It is the phase
Ph (fk) of the frequency component.
[0020]
When the sampling frequency is Fr [Hz] and the frame length is N [samples], k takes an integer
value from 0 to (N / 2) -1, and k = 0 is 0 [Hz] (direct current), k = (N / 2) -1 represents Fr / 2 [Hz]
04-05-2019
7
(highest frequency component), and the interval is equally divided by the frequency resolution
Δf = (Fr / 2) N ((N / 2) -1) [Hz] Is the frequency at each k, and is expressed by fk = k · Δf.
[0021]
Note that, as described above, the frequency resolution unit 3 performs this processing
continuously with a predetermined interval (frame shift amount Fs) separated, thereby forming a
frequency consisting of a power value and a phase value for each frequency of input amplitude
data. Generate decomposition datasets in time series.
[0022]
(Two-Dimensional Data Generation Unit 4 and Graphic Detection Unit 5) As shown in FIG. 5, the
two-dimensional data generation unit 4 includes a phase difference calculation unit 301 and a
coordinate value determination unit 302.
The figure detection unit 5 includes a voting unit 303 and a straight line detection unit 304.
[0023]
[Phase difference calculation unit 301] The phase difference calculation unit 301 compares the
two frequency-resolved data sets a and b at the same time obtained by the frequency
decomposition unit 3 and compares the phase values of the two frequency components for each
same frequency component. It is a means to generate ab phase difference data obtained by
calculating the difference.
For example, as shown in FIG. 6, the phase difference .DELTA.Ph (fk) of a certain frequency
component fk is calculated by calculating the difference between the phase value Ph1 (fk) of the
microphone 1a and the phase value Ph2 (fk) of the microphone 1b. It is calculated as a
remainder system of 2π so as to fall within {ΔPh (fk): −π <ΔPh (fk) ≦ π}.
[0024]
04-05-2019
8
[Coordinate Value Determination Unit 302] The coordinate value determination unit 302
calculates phase difference data obtained by calculating the difference between the phase values
of both frequency components based on the phase difference data obtained by the phase
difference calculation unit 301. It is a means for determining coordinate values to be treated as
points on a predetermined two-dimensional XY coordinate system. The X coordinate value x (fk)
and the Y coordinate value y (fk) corresponding to the phase difference ΔPh (fk) of a certain
frequency component fk are determined by the equation shown in FIG. The X coordinate value is
a phase difference ΔPh (fk), and the Y coordinate value is a frequency component number k.
[0025]
[Frequency Proportionality of Phase Difference to Same Time Difference] The phase difference
for each frequency component calculated by the phase difference calculation unit 301 as shown
in FIG. 6 is the same arrival as that derived from the same sound source (the same direction) It
should represent the time difference. The phase value of a certain frequency obtained by FFT and
the phase difference between microphones are values calculated with the period of that
frequency as 2π. Here, even if the time difference is the same, attention is focused on the
existence of a proportional relationship in which the phase difference is also doubled if the
frequency is doubled. This is shown in FIG. As shown in FIG. 8A, for the same time T, the wave
130 of the frequency fk [Hz] includes a half period, that is, a phase section of only π, but the
wave 131 of the double frequency 2fk [Hz] Includes a phase period of one period, that is, 2π.
The same is true for the phase difference. That is, for the same time difference ΔT, the phase
difference increases in proportion to the frequency. The proportional relationship between such
a phase difference and frequency is shown in FIG. 8 (b). When the phase difference of each
frequency component generated from the same sound source and making ΔT common is plotted
on a two-dimensional coordinate system by coordinate value calculation shown in FIG. 7, the
coordinate point 132 representing the phase difference of each frequency component is a
straight line 133 It is understood that it is lined up on the. As the ΔT is larger, that is, as the
distance to the sound source is different among the microphones, the slope of the straight line is
larger.
[0026]
[Circativeness of phase difference] However, it is true phase difference between the lowest
frequency to be analyzed and the highest frequency that the phase difference between the
microphones is proportional to the frequency in the whole area as shown in FIG. 8B. Is not
deviated from ± π. This condition is that ΔT does not become equal to or longer than a half
04-05-2019
9
cycle of the highest frequency (half of the sampling frequency) Fr / 2 [Hz], that is, 1 / Fr
[seconds]. If ΔT is 1 / Fr or more, it should be taken into consideration that the phase difference
can only be obtained as a cyclic value as described below.
[0027]
The phase value for each frequency component can be obtained only as a value of the rotation
angle θ shown in FIG. 4 with a width of 2π (in the present embodiment, a width of 2π between
−π and π). This means that even if the actual phase difference in the frequency component is
one or more cycles between microphones, it can not be known from the phase value obtained as
a result of frequency decomposition. Therefore, in the present embodiment, as shown in FIG. 6,
the phase difference is obtained between −π and π. However, the true phase difference due to
ΔT may be a value obtained by adding or subtracting 2π to the value of the phase difference
obtained here, or further adding or subtracting 4π or 6π. This is schematically shown in FIG. In
FIG. 9, when the phase difference ΔPh (fk) of the frequency fk is + π as represented by the
black circle 140, the phase difference of the frequency fk + 1, which is one higher, exceeds + π
as represented by the white circle 141. However, the calculated phase difference ΔPh (fk + 1) is
a value obtained by subtracting 2π from the original phase difference, and is a slightly larger
value than −π as represented by the black circle 142. Although not shown in the figure, the
same value will be shown at the frequency three times that, but this is a value obtained by
subtracting 4π from the actual phase difference. Thus, as the frequency increases, the phase
difference circulates between -π and π as a 2π remainder system. As shown in this example,
when ΔT increases, the true phase difference represented by the white circle circulates to the
opposite side as indicated by the black circle from a certain frequency fk + 1.
[0028]
[Phase Difference When Plural Sound Sources Exist] On the other hand, when sound waves are
emitted from a plurality of sound sources, a plot of frequency and phase difference is as
schematically shown in FIG. This figure shows the case where two sound sources are present in
different directions with respect to the microphone pair, and FIG. 10 (a) is the case where the two
sound sources do not contain the same frequency component as each other, FIG. 10 (b) shows
the case where some frequency components are included in both. In FIG. 10A, the phase
difference of each frequency component is on one of the straight lines sharing ΔT in common,
and the straight line 150 having a small slope has five points, and the straight line 151 having a
large slope (including the circulated straight line 152). Here, six points are arranged on a straight
line. In FIG. 10 (b), the waves are mixed in the two frequency components 153 and 154 included
04-05-2019
10
in both and the phase difference is not correctly output, so that only three points are straight
lines in the straight line 155 having a small inclination, especially I'm not riding on top.
[0029]
The problem of estimating the number and direction of sound sources can be reduced to finding
straight lines as shown on such a plot. Further, the problem of estimating the frequency
component for each sound source can be reduced to the selection of the frequency component
arranged at a position close to the detected straight line. In the present embodiment, the twodimensional data output from the two-dimensional data conversion unit 4 is a point group
determined as a function of frequency and phase difference using two of the frequency
decomposition data set by the frequency decomposition unit 3 or Let the group be an image
arranged (plotted) on a two-dimensional coordinate system. Note that this two-dimensional data
is defined by two axes not including the time axis, and hence three-dimensional data as a time
series of two-dimensional data can be defined. The figure detection unit 5 detects a linear
arrangement as a figure from the point group arrangement given as this two-dimensional data
(or three-dimensional data in time series).
[0030]
[Voting section 303] Voting section 303 applies a linear Hough transform to each frequency
component given (x, y) coordinates by coordinate value determination section 302, as will be
described later, and the locus thereof is a Hough voting space Is a means to vote in a
predetermined manner. The Hough transform is described on pages 100 to 102 of reference 2
Akira Okazaki, First Image Processing , Industrial Survey Committee, published on October
20, 2000 , but it is explained here again Do.
[0031]
[Linear Hough Transform] As schematically shown in FIG. 11, straight lines which can pass
through the point p (x, y) in two-dimensional coordinates are innumerable as exemplified in 160,
161, 162, but the origin O Letting θ be the inclination from the X axis of perpendicular line 163
drawn to each straight line, and 表現 be the length of perpendicular line 163, θ and ρ are
uniquely determined for one straight line. It is known that a set of possible θ and 直線 of a
straight line passing a certain point (x, y) draws a locus 164 (ρ = x cos θ + y sin θ) specific to
04-05-2019
11
the value of (x, y) on the θρ coordinate system There is. Such conversion from a (x, y)
coordinate value to a (θ,)) locus of a straight line passing through (x, y) is referred to as a
straight Hough transform. Note that θ is a positive value when the line is inclined left, 0 when it
is vertical, and a negative value when it is inclined right, and the domain of θ is {θ: −π <θ ≦
π There is no departure from
[0032]
The Hough curve can be obtained independently for each point on the XY coordinate system, but
as shown in FIG. 12, for example, a straight line 170 commonly passing three points p1, p2 and
p3 corresponds to p1, p2 and p3. It can be determined as a straight line defined by the
coordinates (θ0, 00) of the point 174 where the trajectories 171, 172, 173 intersect. As a
straight line passes through many points, more trajectories pass through the positions of θ and
表 す representing the straight line. Thus, the Hough transform is suitable for detecting straight
lines from point clouds.
[0033]
[Hough voting] An engineering method called Hough voting is used to detect a straight line from
a point group. This is by voting the pairs of passing θ and ρ of each trajectory in a twodimensional Hough voting space having θ and 座標 as coordinate axes, and passing θ and ρ of
many trajectories to large positions of votes in the Hough voting space , Or a method to indicate
the existence of a straight line. In general, first, a two-dimensional array (Hough voting space)
having a size corresponding to the required search range for θ and ρ is prepared and initialized
to zero. Next, a locus for each point is obtained by Hough transformation, and the value on the
array through which the locus passes is added by one. This is called a Hough vote. After voting
the trajectory for all points, there is no straight line at the position of vote 0 (no trajectory has
passed) and 1 for the position of vote 1 (only one trajectory has passed) A straight line passing
through two points passes through two points at the position of vote 2 (only two trajectories
have passed) and n points at a position of vote n (only n trajectories have passed) It can be seen
that each passing straight line exists. If the resolution of the Hough voting space can be made
infinite, as mentioned above, only the passing points of the trajectory will get votes as many as
the number of trajectories passing there, but the actual Hough voting space is about θ and ρ
Since quantization is performed with an appropriate resolution, a high score distribution also
occurs around positions where multiple trajectories intersect. Therefore, it is necessary to more
accurately determine the crossing position of the trajectory by searching for the position having
the maximum value from the vote distribution of the Hough voting space.
04-05-2019
12
[0034]
The voting unit 303 performs Hough voting on frequency components that satisfy all of the
following voting conditions. Under this condition, only frequency components having a power
equal to or higher than a predetermined threshold in a predetermined frequency band are voted.
[0035]
That is, it is assumed that the voting condition 1 is one in which the frequency is in a
predetermined range (a low cut and a high cut). In the voting condition 2, the power P (fk) of the
frequency component fk is equal to or higher than a predetermined threshold.
[0036]
Voting condition 1 is generally used for the purpose of cutting the low frequency range where
background noise is on or cutting the high frequency range in which the FFT accuracy is poor.
The range of the low band cut and the high band cut can be adjusted in accordance with the
operation. In the case of using the widest frequency band, it is suitable to set the low band cut
only to the DC component and the high band cut to only the maximum frequency.
[0037]
It is considered that the FFT result is not highly reliable at a very weak frequency component as
low as background noise. Voting condition 2 is used for the purpose of preventing participation
in voting by thresholding such unreliable frequency components with power. Assuming the
power value Po1 (fk) of the microphone 1a and the power value Po2 (fk) of the microphone 1b,
the following three can be considered as how to determine the power P (fk) evaluated at this
time. Note that which condition is used can be set according to the operation.
[0038]
04-05-2019
13
(Average value): The average value of Po1 (fk) and Po2 (fk). Both conditions require that both
powers be reasonably strong.
[0039]
(Minimum value): The smaller of Po1 (fk) and Po2 (fk). It is a condition that requires both powers
to be at least a threshold or more.
[0040]
(Maximum value): A larger one of Po1 (fk) and Po2 (fk). Even if one is less than the threshold
value, the condition is that if the other is strong enough to vote.
[0041]
In addition, the voting unit 303 can perform the following two addition methods when voting.
[0042]
That is, in the addition method 1, a predetermined fixed value (for example, 1) is added to the
passage position of the trajectory.
In the addition method 2, a function value of the power P (fk) of the frequency component fk is
added to the passage position of the locus.
[0043]
The addition method 1 is a method that is generally used in the straight line detection problem
by the Hough transform, and because the votes are ranked in proportion to the number of
passing points, the straight line including many frequency components Suitable for detecting the
sound source) preferentially. At this time, since there is no limitation on the harmonic structure
04-05-2019
14
(that the included frequencies are equally spaced) for the frequency components included in the
straight line, a wider variety of sound sources can be detected without being limited to human
voice.
[0044]
Also, addition method 2 is a method that can obtain higher local maximum values if it contains
frequency components with large power even if there are few points to pass through, and it is a
powerful component with large power even if there are few frequency components. Are suitable
for detecting a straight line (ie, a sound source) having The function value of the power P (fk) in
the addition method 2 is calculated as G (P (fk)). FIG. 13 shows a calculation formula of G (P (fk))
when P (fk) is an average value of Po1 (fk) and Po2 (fk). Besides this, it is also possible to
calculate P (fk) as the minimum value or the maximum value of Po1 (fk) and Po2 (fk) as in the
case of voting condition 2 described above. Can be set. The value of the intermediate parameter
V is calculated as a value obtained by adding a predetermined offset α to the logarithmic value
log 10 (P (fk)) of P (fk). When V is positive, the value of V + 1 is taken, and when V is less than
zero, 1 is taken as the value of the function G (P (fk)). In this way, by voting at least 1 in this way,
not only the straight line (sound source) including the frequency component with large power
rises to the top, but also the straight line (sound source) including multiple frequency
components also to the top It can be combined with the majority property of 1. Voting section
303 can perform either addition method 1 or addition method 2 depending on the setting, but by
using the latter in particular, it becomes possible to simultaneously detect a sound source with
few frequency components, and a wider variety of types can be used. It becomes possible to
detect the sound source of
[0045]
Further, the voting unit 303 may vote for each FFT, but generally, m continuous time-series FFT
results (m ≧ 1) are obtained. We will vote together. Although the frequency component of the
sound source fluctuates in the long-term, in this way, using more data obtained from the stable
and moderately short-term multiple-time FFT results of the frequency component You will be
able to get more reliable Hough voting results. In addition, this m can be set as a parameter
according to the operation.
[0046]
04-05-2019
15
Straight Line Detection Unit 304 The straight line detection unit 304 is a unit that analyzes the
vote distribution on the Hough voting space generated by the voting unit 303 to detect a strong
straight line. However, at this time, by taking into consideration circumstances unique to this
problem, such as the cyclicity of the phase difference described in FIG. 9, more accurate straight
line detection is realized.
[0047]
FIG. 14 shows the power spectrum of frequency components when processing is performed
using an actual voice uttered from the left approximately 20 degrees in front of the microphone
pair in an indoor noise environment, for five consecutive times (the m = The phase difference plot
figure for every frequency component obtained from the FFT result of 5) and the Hough voting
result (sector distribution) obtained from the same 5 times of FFT result are shown. The
processing up to this point is performed by a series of functional blocks from the sound signal
input unit 2 to the voting unit 303.
[0048]
The amplitude data acquired by the microphone pair is converted by the frequency
decomposition unit 3 into data of power value and phase value for each frequency component. In
FIG. 14, reference numerals 180 and 181 denote luminance (larger as black indicates) of
logarithm of the power value for each frequency component with time on the horizontal axis.
One vertical line corresponds to one FFT result, which is graphed along with the passage of time
(rightward). The upper stage 180 is the result of processing the signal from the microphone 1a
and the lower stage 181, and a large number of frequency components are detected. In response
to the frequency decomposition result, the phase difference calculation unit 301 obtains a phase
difference for each frequency component, and the coordinate value determination unit 302
calculates the (x, y) coordinate value. In FIG. 14, reference numeral 182 denotes a plot of phase
differences obtained by FFT five consecutive times from a certain time 183. In this figure, a point
cloud distribution along a straight line 184 inclined to the left from the origin is observed, but
the distribution is not clearly on the straight line 184, and a large number of points away from
the straight line 184 exist. There is. The voting section 303 votes each point showing such a
distribution in the Hough voting space to form a vote distribution 185. Note that reference
numeral 185 in the figure denotes a vote distribution generated using the addition method 2.
04-05-2019
16
[0049]
[Constraint of = 0 = 0] By the way, when the signals of the microphone 1a and the microphone
1b are A / D converted in phase by the acoustic signal input unit 2, the straight line to be
detected is always = 0 = 0, ie, the origin of the XY coordinate system Pass through. Therefore, the
estimation problem of the sound source results in the problem of searching for the maximum
value from the vote distribution S (θ, 0) on the θ axis where ρ = 0 in the Hough voting space.
The result of searching for the maximum value on the θ axis with respect to the data illustrated
in FIG. 14 is shown in FIG.
[0050]
In FIG. 15, the vote distribution 190 is the same as the vote distribution 185 in FIG. The bar
graph 192 is obtained by extracting the vote distribution S (θ, 0) on the θ axis 191 as H (θ).
There are some maximum points (protrusions) in this vote distribution H (θ). When the straight
line detection unit 304 searches the vote distribution H (θ) as long as (1) points with the same
point continue to the left and right with respect to a certain position, only the one with the score
that is lower than itself finally appeared Leave. As a result, the local maximum on the vote
distribution H (θ) is extracted, but since the local maximum has a flat top, the local maximum
continues. Therefore, the straight line detection unit 304 leaves only the central position of the
maximum portion as the maximum position 193 by (2) thinning processing. Finally, (3) only the
maximum position where the vote is equal to or more than a predetermined threshold is detected
as a straight line. By doing this, it is possible to accurately determine the straight line θ for
which a sufficient vote is obtained. In the example of the figure, of the local maximum positions
194, 195 and 196 detected in the above (2), the central position left by the thinning process
from the local maximum of the local maximum position 194 (the right takes precedence when
even continuous) It is. Also, only 196 is a straight line detected by obtaining a vote above the
threshold. A straight line (reference straight line) 197 is defined by θ and ρ (= 0) given by the
maximum position 196. In addition, as an algorithm of the thinning process, it is possible to use
Tamura's method described in pages 89 to 92 of reference 2 referred to in the description of
the Hough transform in one dimension. When the straight line detection unit 304 detects one or
a plurality of local maximum positions (the central position at which a vote above a
predetermined threshold is obtained) in this manner, the straight line detection unit 304 ranks in
descending order of the number of votes and values of θ and ρ of each local maximum Output
[0051]
04-05-2019
17
[Definition of Line Group Considering Phase Difference Circulation] The line 197 exemplified in
FIG. 15 is a line passing the XY coordinate origin defined by the maximum position 196 of (θ0,
0). However, in reality, due to the cyclic nature of the phase difference, the straight line 198 in
FIG. 15 moves in parallel by Δρ 199 and circulates from the opposite side on the X axis is also
a straight line showing the same arrival time difference as 197. The straight line 198 is extended
by a straight line 197, and the part extending out of the X range cyclically appears from the
opposite side, the straight line 197 "circulation extension line", the reference straight line 197
"reference straight line" I will call each. If the reference straight line 197 is further inclined, the
number of circulation extension lines will further increase. Here, assuming that the coefficient a
is an integer greater than or equal to 0, all straight lines having the same arrival time difference
become straight lines (θ0, aΔρ) in which the reference straight line 197 defined by (θ0, 0) is
moved in parallel by Δρ. Furthermore, the linear group can be described as (θ0, aΔρ + ρ0) if
the constraint of = 0 = 0 is removed with respect to the starting point ρ and generalized as ==
ρ0. At this time, Δρ is a signed value defined by the equation shown in FIG. 16 as a function
Δρ (θ) of the inclination θ of the straight line.
[0052]
In FIG. 16, a reference straight line 200 is defined by (θ, 0). At this time, since the reference
straight line 200 is inclined to the right, θ is a negative value according to the definition, but is
treated as the absolute value in the figure. A straight line 201 in FIG. 16 is a circulation extension
line of the reference straight line 200, and intersects the X axis at a point R. Also, the distance
between the reference straight line 200 and the circulation extension line 201 is Δρ as shown
by the auxiliary line 202, and the auxiliary line 202 intersects perpendicularly with the reference
straight line 200 at point O, and intersects the circulation extension line 201 perpendicularly at
point U doing. At this time, since the reference straight line is inclined to the right, Δρ is also a
negative value according to the definition, but it is treated as the absolute value in the figure.
ΔOQP in FIG. 16 is a right triangle whose length of the side OQ is π, and a triangle congruent
with this is ΔRTS. Therefore, it can be seen that the length of the side RT is also π, and the
length of the oblique side OR of ΔOUR is 2π. At this time, since Δρ is the length of the side
OU, Δρ = 2π cos θ. Then, considering the signs of θ and Δρ, the formula of the figure is
derived.
[0053]
[Detection of Maximum Position Considering Phase Difference Circulation] It was described that
04-05-2019
18
the straight line representing the sound source should be treated as a straight line group
consisting of a reference straight line and a circulation extension line instead of one because of
the cyclicity of the phase difference. This must also be taken into account when detecting local
maxima from the vote distribution. In general, if it is limited to detecting the sound source only
near the front of the microphone pair that does not cause phase difference circulation or occurs
at a small scale even if it occurs, a vote value on ρ = 0 (or == ρ0) The above-mentioned method
of searching for the maximum position only by the straight line vote value) is not only sufficient
in performance but also effective in shortening the search time and improving the accuracy.
However, in order to detect a sound source existing in a wider range, it is necessary to search for
the local maximum position by summing several vote values separated by Δρ for a certain θ.
This difference is explained below.
[0054]
FIG. 17 shows the power spectrum of the frequency component when processing is performed
using actual speech in which two persons uttered from the front approximately 20 degrees left
and approximately 45 degrees right of the front of the microphone pair in an indoor noise
environment (five times ( The phase difference plot figure for every frequency component
obtained from the FFT result of m = 5), and the Hough voting result (sector distribution) obtained
from the same 5 times FFT result are shown.
[0055]
The amplitude data acquired by the microphone pair is converted by the frequency
decomposition unit 3 into data of power value and phase value for each frequency component.
In FIG. 17, reference numerals 210 and 211 denote luminances (the larger the blacks), the
logarithms of the power values for each frequency component, with the vertical axis representing
frequency and the horizontal axis representing time. One vertical line corresponds to one FFT
result, which is graphed along with the passage of time (rightward). The upper stage 210 is the
result of processing the signal from the microphone 1a and the lower stage 211, and a large
number of frequency components are detected. In response to the frequency decomposition
result, the phase difference calculation unit 301 obtains a phase difference for each frequency
component, and the coordinate value determination unit 302 calculates the (x, y) coordinate
value. The plot 212 is a plot of the phase difference obtained by FFT five consecutive times from
a certain time 213. In this plot 212, the point group distribution along the reference straight line
214 inclined to the left from the origin and the point group distribution along the reference
straight line 215 inclined to the right are recognized. The voting section 303 votes each point
04-05-2019
19
showing such a distribution to the Hough voting space to form a vote distribution 216. The vote
distribution 216 is generated using the addition method 2.
[0056]
FIG. 18 is a diagram showing the result of searching for the maximum position only with the vote
value on the θ axis. The vote distribution 220 in FIG. 18 is the same as the vote distribution 216
in FIG. The bar graph 222 is a bar graph obtained by extracting the vote distribution S (θ, 0) on
the θ axis 221 as H (θ). It is understood that although there are some local maximum points
(protrusions) in this vote distribution H (θ), the votes are reduced as the absolute value of θ
increases. As shown in the local maximum position graph 223, four local maximum positions
224, 225, 226, and 227 are detected from the vote distribution H (θ). Among them, only the
maximum position 227 obtains a vote above the threshold. As a result, one straight line group
(reference straight line 228 and circulation extension line 229) is detected. Although this straight
line group detects the sound from about 20 degrees left of the front of the microphone pair, the
sound from about 45 degrees right of the front of the microphone pair can not be detected. The
width of the frequency band through which the reference straight line passes differs depending
on θ (unfairness), since the reference straight line passing through the origin can pass only a
small frequency band until the angle exceeds a value range of X as the angle increases. And since
the constraint of = 0 = 0 makes it possible to compete for the votes of only the reference straight
line under this unfair condition, the straight line with a large angle is disadvantageous in the
points. This is the reason why the voice from the right about 45 degrees could not be detected.
[0057]
On the other hand, FIG. 19 is a diagram showing the result of searching for the maximum
position by summing the votes of several points separated by Δρ. In the figure, 240 indicates
the position of ρ when a straight line passing through the origin is moved in parallel by Δρ on
the vote distribution 216 in FIG. At this time, the θ axis 241 and the broken lines 242 to 245,
and the θ axis 241 and the broken lines 246 to 249 are separated by equal natural numbers of
Δρ (θ). It should be noted that there is no broken line at θ = 0 where it is certain that the
straight line passes to the ceiling of the plot without exceeding the value range of X.
[0058]
04-05-2019
20
A certain θ0 score H (θ0) is the sum of the score on the θ axis 241 and the score on the
broken line 242 to 249 when viewed vertically at the position of θ = θ0, that is, H (θ0) = Σ {S
( Calculated as θ0, aΔρ (θ0))}. This operation corresponds to adding the votes of the reference
straight line where θ = θ0 and its circulation extension line. It is 250 in a figure which made
this vote distribution H ((theta)) the bar graph. Unlike 222 in FIG. 18, in this distribution, the
votes are not reduced even if the absolute value of θ increases. This is because the same
frequency band can be used for all θ by adding the circulation extension line to the vote
calculation. From this vote distribution 250, ten maximum positions shown in 251 in the figure
are detected. Among them, a group of straight lines (a reference straight line 254 and a
circulation extension line 255 corresponding to the maximum position 253) which obtain a vote
of maximum positions 252 and 253 above the threshold value and detect voice from about 20
degrees front of the microphone pair. And two straight lines (reference straight line 256
corresponding to the maximum position 252 and circulation extension lines 257 and 258) in
which the voice from about 45 degrees right of the front of the microphone pair is detected. As
described above, it is possible to stably detect from a straight line with a small angle to a straight
line with a large angle by searching for the maximum position by summing the votes of the
points separated by ΔΔ.
[0059]
[Maximum position detection considering non-in-phase case: generalization] Now, when the
signals of the microphone 1a and the microphone 1b are not A / D converted in phase by the
acoustic signal input unit 2, the straight line to be detected is ρ = 0, That is, it does not pass the
XY coordinate origin. In this case, it is necessary to search for the maximum position without the
constraint of = 0 = 0.
[0060]
If a reference straight line from which the constraint of = 0 = 0 is removed is generalized and
described as (θ0, 00), the straight line group (the reference straight line and the circulation
extension line) can be described as (θ0, aΔρ (θ0) + ρ0). Here, Δρ (θ0) is a parallel
displacement of the circulation extension line determined by θ0. When the sound source comes
from a certain direction, the corresponding straight line group at θ 0 is only the most powerful
one. The line group is (θ0, aΔρ (θ0) ++ 0max) using the value 00max of 00 at which the votes
票 {S (θ0, aΔρ (θ0) + ρ0)} of the line group when various と き 0 are changed are maximum.
Given by Therefore, the same maximum position detection algorithm as in the case of = 0 = 0
restriction is applied by setting the vote H (θ) at each θ to the maximum vote value {{S (θ,
04-05-2019
21
aΔρ (θ) + ρ0max)} at each θ. Linear detection can be performed.
[0061]
The number of linear groups detected in this manner is the number of sound sources.
[0062]
[Sound Source Information Generation Unit 6] As shown in FIG. 20, the sound source information
generation unit 6 continues the direction estimation unit 311, the sound source component
estimation unit 312, the sound source sound resynthesis unit 313, the time series tracking unit
314, A time evaluation unit 315, an in-phase unit 316, an adaptive array processing unit 317,
and a voice recognition unit 318 are provided.
[0063]
[Direction Estimating Unit 311] The direction estimating unit 311 receives the straight line
detection result by the straight line detecting unit 304 described above, that is, the θ value for
each straight line group, and calculates the existence range of the sound source corresponding to
each straight line group. It is.
At this time, the number of detected straight line groups is the number of sound sources (all
candidates).
If the distance to the sound source is sufficiently far from the baseline of the microphone pair,
the range of the sound source is a conical surface with an angle to the baseline of the
microphone pair. This will be described with reference to FIG.
[0064]
The arrival time difference ΔT between the microphone 1a and the microphone 1b can change
in the range of ± ΔTmax. As shown in FIG. 21A, when incident from the front, ΔT is 0, and the
azimuth angle φ of the sound source is 0 ° with reference to the front. Also, as shown in FIG.
21 (b), when the sound is incident right from the right side, ie, from the direction of the
microphone 1b, ΔT is equal to + ΔTmax, and the azimuth angle φ of the sound source is + 90
04-05-2019
22
° assuming positive clockwise with respect to the front. . Similarly, as shown in FIG. 21 (c), when
the voice is incident from the left right side, that is, from the direction of the microphone 1a, ΔT
is equal to -ΔTmax, and the azimuth angle φ is -90 °. Thus, ΔT is defined as positive when the
sound is incident from the right and negative when the sound is incident from the left.
[0065]
Based on the above, general conditions as shown in FIG. 21 (d) are considered. Assuming that the
position of the microphone 1a is A, the position of the microphone 1b is B, and the voice is
incident from the direction of the line segment PA, ΔPAB is a right triangle with the vertex P at a
right angle. At this time, with an inter-microphone center O and a line segment OC as the front
direction of the microphone pair, an angle that positively assumes a counterclockwise direction
with the OC direction being an azimuth angle of 0 ° is defined as an azimuth angle φ. Since
ΔQOB is a similar form of ΔPAB, the absolute value of the azimuthal angle φ is equal to ∠OBQ,
ie, BPABP, and the sign corresponds to the sign of ΔT. Also, ∠ABP can be calculated as sin
<−1> of the ratio of PA to AB. At this time, when the length of the line segment PA is
represented by ΔT corresponding to this, the length of the line segment AB corresponds to
ΔTmax. Therefore, including the sign, the azimuth can be calculated as φ = sin <−1> (ΔT /
ΔTmax). Then, the existence range of the sound source is estimated as a conical surface 260
opened at (90−φ) ° with the point O as a vertex and the baseline AB as an axis. The source is
somewhere on this conical surface 260.
[0066]
As shown in FIG. 22, ΔTmax is a value obtained by dividing the distance between microphones L
[m] by the sound velocity Vs [m / sec]. At this time, it is known that the sound velocity Vs can be
approximated as a function of the air temperature t [° C.]. Now, it is assumed that the straight
line 270 is detected by the straight line detection unit 304 at the inclination θ of Hough. Since
the straight line 270 is inclined to the right, θ is a negative value. When y = k (frequency fk), the
phase difference ΔPh indicated by the straight line 270 can be obtained by k · tan (−θ) as a
function of k and θ. At this time, ΔT [sec] is a time obtained by multiplying one period (1 / fk)
[sec] of the frequency fk by the ratio of the phase difference ΔPh (θ, k) to 2π. Since θ is a
signed quantity, ΔT is also a signed quantity. That is, when the sound is incident from the right
in FIG. 21D (the phase difference ΔPh has a positive value), θ has a negative value. Also, when
the sound is incident from the left in FIG. 21D (the phase difference ΔPh has a negative value),
θ has a positive value. Therefore, the sign of θ is inverted. In the actual calculation, the
calculation may be performed at k = 1 (the frequency immediately above the DC component k =
04-05-2019
23
0).
[0067]
[Sound source component estimation unit 312] The sound source component estimation unit
312 evaluates the distance between the (x, y) coordinate value for each frequency component
given by the coordinate value determination unit 302 and the straight line detected by the
straight line detection unit 304. By doing this, it is a means for detecting a point located in the
vicinity of a straight line (ie, frequency component) as a frequency component of the straight line
(ie, sound source), and estimating the frequency component for each sound source based on this
detection result.
[0068]
[Detection by Distance Threshold Method] FIG. 23 schematically shows the principle of sound
source component estimation when there are a plurality of sound sources.
FIG. 23 (a) is a plot of the same frequency and phase difference as shown in FIG. 9, and shows
the case where two sound sources are present in different directions with respect to the
microphone pair. In FIG. 23 (a), 280 represents one straight line group, and 281 and 282 in FIG.
23 (a) constitute another straight line group. The black circles in FIG. 23A indicate the phase
difference position for each frequency component.
[0069]
The frequency components constituting the sound source sound corresponding to the straight
line group 280 are located within a region 286 which is sandwiched between the straight line
284 and the straight line 285 separated from the straight line 280 by the horizontal distance
283 as shown in FIG. Detected as a frequency component (black circle in the figure). The fact that
a certain frequency component is detected as a component of a straight line means that the
frequency component belongs to (or belongs to) a straight line.
[0070]
Similarly, as shown in FIG. 23C, the frequency components constituting the sound source sound
corresponding to the straight line group (281, 282) are sandwiched by straight lines separated
04-05-2019
24
by a horizontal distance 283 from the straight line 281 and the straight line 282 respectively. It
is detected as frequency components (black circles in the figure) located in the areas 287 and
288.
[0071]
At this time, since two points of the frequency component 289 and the origin (DC component)
are included in both the area 286 and the area 288, they are doubly detected as components of
both sound sources (multiplex attribution).
In this manner, the horizontal distance between the frequency component and the straight line is
subjected to threshold processing to select the frequency component present in the threshold for
each straight line group (sound source), and the power and phase thereof are used as the
components of the sound source sound. Will be referred to as "distance threshold method".
[0072]
[Detection by the Nearest Neighbor Method] FIG. 24 is a diagram showing the result of making
the frequency component 289 which belongs to multiple in FIG. 23 belong to only the straight
line group which is closest to the other. As a result of comparing the horizontal distances of the
frequency component 289 with the straight line 280 and the straight line 282, it is found that
the frequency component 289 is closest to the straight line 282. At this time, the frequency
component 289 is in the area 288 near the straight line 282. Therefore, the frequency
component 289 is detected as a component belonging to the straight line group (281, 282) as
shown in FIG. 24 (b). Thus, a method is selected in which the closest straight line (sound source)
in the horizontal distance is selected for each frequency component, and when the horizontal
distance is within the predetermined threshold, the power and phase of the frequency
component are used directly as the components of the sound source sound. We call it the
"nearest neighbor method." The direct current component (origin) should be attributed to both
straight line groups (sound sources) as a special treatment.
[0073]
[Detection by distance coefficient method] The above two methods select only frequency
04-05-2019
25
components existing within a predetermined horizontal distance threshold with respect to
straight lines constituting a straight line group, and correspond to the straight line group while
keeping the power and phase unchanged. As the frequency component of the source sound to be
On the other hand, the distance factor method described next calculates the non-negative
coefficient α which monotonically decreases according to the increase of the horizontal distance
d between the frequency component and the straight line, and multiplies this by the power of the
frequency component. In this method, the component farther from the horizontal distance
contributes to the source sound with weaker power.
[0074]
At this time, it is not necessary to perform threshold processing by the horizontal distance, and
the horizontal distance (horizontal distance with the nearest straight line in the straight line
group) d to a certain straight line group is determined and determined based on the horizontal
distance d. A value obtained by multiplying the power of the frequency component by the
coefficient α to be calculated is the power of the frequency component in the straight line group.
Although the formula for calculating the nonnegative coefficient α monotonically decreasing
according to the increase of the horizontal distance d is arbitrary, a sigmoid (S-shaped curve)
function α = exp (− (B · d) <C>) shown in FIG. 25 as an example Can be mentioned. At this time,
as illustrated in the figure, assuming that B is a positive number (1.5 in the figure) and C is a
number greater than 1 (2.0 in the figure), when d = 0, α = 1, d → ∞ At that time, α → 0. As the
degree of decrease of the nonnegative coefficient α is steep, that is, when the value of B is large,
the component deviated from the straight line group is easily excluded, so the directivity to the
sound source direction becomes sharp and conversely the degree of decrease of the nonnegative
coefficient α is slow If B is small, directivity will be dull.
[0075]
[Treatment of Multiple FFT Results] As described above, the voting unit 303 may vote for each
FFT or may continuously vote m consecutive (m ≧ 1) FFT results. is there. Therefore, the
functional blocks after the straight line detection unit 304 that processes the Hough voting result
operate in units of a period in which one Hough transform is performed. At this time, when the
Hough voting is performed at m ≧ 2, the FFT results at a plurality of times are classified as
components constituting the respective sound source sounds, and the same frequency
components at different times belong to different sound source sounds. It can also happen. In
order to handle this, regardless of the value of m, the start time of the frame in which it is
acquired is used as the acquisition time information for each frequency component (that is, the
04-05-2019
26
black circle illustrated in FIG. 24) by the coordinate value determination unit 302. It is added,
making it possible to refer to which frequency component at which time point belongs to which
sound source. That is, the sound source sound is separated and extracted as time series data of
its frequency component.
[0076]
[Power Storage Option] In each of the above-described methods, each of the frequency
components belonging to a plurality (N) of linear groups (sound sources) (direct current
components in the nearest proximity method and all frequency components in the distance
coefficient method) It is also possible to normalize and divide the power of the frequency
components of the same time allocated to the sound source into N so that the sum is equal to the
power value Po (fk) of the time before allocation. By doing this, the total power of the entire
sound source can be kept the same as the input for each frequency component at the same time.
This will be called "power save option". There are two ways of thinking about the way of
allocation.
[0077]
That is, it is a distribution (applicable to the distance threshold method and the distance
coefficient method) according to the distance between (1) N equal parts (applicable to the
distance threshold method and the nearest neighbor method) and (2) each straight line group.
[0078]
(1) is a distribution method in which normalization is achieved automatically by dividing into N
equal parts, and is applicable to the distance threshold method and the nearest neighbor method
for determining the distribution regardless of the distance.
[0079]
(2) is a distribution method for storing the sum of power by further normalizing so that the sum
thereof becomes 1 after determining the coefficients in the same manner as the distance factor
method, and occurrence of multiple attributions other than the origin point Is applicable to the
distance threshold method and the distance factor method.
[0080]
04-05-2019
27
Note that the sound source component estimation unit 312 can perform any of the distance
threshold method, the nearest neighbor method, and the distance coefficient method by setting.
Also, it is possible to select the above-mentioned power storage option in the distance threshold
scheme and the nearest neighbor scheme.
[0081]
[Sound source sound re-synthesizer 313] The sound source sound re-synthesizer 313 performs
inverse FFT processing on frequency components of the same acquisition time that make up each
sound source sound to generate the sound source sound of the frame section having the time as
the start time ( Re-synthesize the amplitude data).
As illustrated in FIG. 3, one frame overlaps with the next frame with a time difference of the
frame shift amount.
As described above, in the overlapping period in a plurality of frames, the amplitude data of all
the overlapping frames can be averaged to form final amplitude data. Such processing makes it
possible to separate and extract the source sound as its amplitude data.
[0082]
[Time Series Tracking Unit 314] As described above, the straight line detection unit 304 obtains
a straight line group every Hough voting by the voting unit 303. Hough voting is performed
collectively on m consecutive (m ≧ 1) FFT results. As a result, the straight line group can be
obtained in time series as a period of m frames as a period (this will be referred to as a graphic
detection period ). Further, since θ of the straight line group corresponds to the sound source
direction φ calculated by the direction estimation unit 305 in a one-to-one manner, even if the
sound source is stationary or moving, it corresponds to a stable sound source The trajectory on
the time axis of θ (or φ) should be continuous. On the other hand, the straight line group
detected by the straight line detection unit 304 includes a straight line group corresponding to
background noise (referred to as a noise straight line group ) depending on how the
threshold is set. is there. However, it can be expected that the trajectory on the time axis of θ (or
04-05-2019
28
φ) of such a noise linear group is not continuous or is short.
[0083]
The time-series tracking unit 314 is a means for obtaining a locus on the time axis of φ by
dividing φ thus obtained for each figure detection cycle into continuous groups on the time axis.
The method of grouping will be described with reference to FIG.
[0084]
(1) Prepare a locus data buffer. The trajectory data buffer is an array of trajectory data. One track
data Kd can hold the start time Ts, the end time Te, an array (line group list) of straight line
group data Ld constituting the track, and a label number Ln. One straight line group data Ld
includes the θ value and the ρ value (by the straight line detection unit 304) of one straight line
group constituting the locus, and the φ value (by the direction estimation unit 311) representing
the sound source direction corresponding to the straight line group. And the frequency
components (by the sound source component estimation unit 312) corresponding to the straight
line group and the time when they are acquired. The trajectory data buffer is initially empty. Also,
a new label number is prepared as a parameter for issuing a label number, and the initial value is
set to 0.
[0085]
(2) At a certain time T, the locus data buffer is stored for each of newly obtained 以後
(hereinafter referred to as n n, and in the figure, it is assumed that two shown by black circle 303
and black circle 304 are obtained) Referring to straight line group data Ld (black circle arranged
in the rectangle in the figure) of the locus data Kd (rectangles 301 and 302 in the figure), the
difference (305 and 306 in the figure) between the φ value and φn is predetermined Trajectory
data having Ld which is within the angle threshold Δφ and whose difference in acquisition
times (307 and 308 in the figure) is within the predetermined time threshold Δt is detected. As a
result, although locus data 301 is detected for the black circle 303, it is assumed that the closest
locus data 302 for the black circle 304 does not satisfy the above condition.
[0086]
04-05-2019
29
(3) If locus data satisfying the condition (2) is found, as indicated by black circle 303, it is
assumed that φ n forms the same locus as this locus, and this φ n and the corresponding θ
value and ρ value , Frequency component and current time T are added to the straight line
group list as new straight line group data of the locus Kd, and the current time T is set as the new
end time Te of the locus. At this time, when a plurality of trajectories are found, they are
integrated into trajectory data having the youngest label number, assuming that they all form the
same trajectory, and the rest are deleted from the trajectory data buffer. The start time Ts of the
integrated trajectory data is the earliest start time of each trajectory data before integration, the
end time Te is the latest end time of each trajectory data before integration, and the straight line
group list is It is a union of straight line group lists of each trajectory data before integration. As
a result, the black circles 303 are added to the trajectory data 301.
[0087]
(4) As in black circle 304, if no locus data satisfying the condition (2) is found, it is regarded as
the beginning of a new locus, and new locus data is created in the empty part of the locus data
buffer, and start time Let Ts and end time Te both be the current time T, and let φn, the
corresponding θ value, ρ value, frequency component and current time T be the first straight
line group data of the straight line group list, and the value of the new label number be this locus
Label number Ln, and increment the new label number by 1. When the new label number
reaches the predetermined maximum value, the new label number is returned to zero. As a result,
the black circle 304 is registered in the locus data buffer as new locus data.
[0088]
(5) If there is any locus data held in the locus data buffer that has passed the predetermined time
Δt from the last update (that is, from the end time Te thereof) to the current time T, add it This
locus data is output to the duration evaluation unit 315 of the next stage as a locus for which no
new φn was found, ie, for which the tracking has expired, and then the locus data is deleted from
the locus data buffer. In the example of the figure, trajectory data 302 corresponds to this.
[0089]
04-05-2019
30
[Duration evaluation unit 315] The duration evaluation unit 315 calculates the duration of the
trajectory from the start time and the end time of the trace data whose tracking has been output,
which is output by the time series tracking unit 314, and this duration is predetermined Those
exceeding the threshold are identified as trajectory data based on the sound source sound, and
the others are identified as trajectory data based on noise. Trajectory data based on sound source
sound is called sound source stream information. The sound source stream information includes
time-series trajectory data of the start time Ts and the end time Te of the sound source sound,
and θ, ρ, and φ representing the sound source direction. Although the number of linear groups
by the figure detection unit 5 gives the number of sound sources, noise sources are also included
therein. The number of sound source stream information by the duration evaluation unit 315
gives the number of reliable sound sources excluding those based on noise.
[0090]
[In-phase unit 316] The in-phase unit 316 obtains the time transition of the sound source
direction φ of the stream by referring to the sound source stream information by the time series
tracking unit 314, and obtains the maximum value φ max and the minimum value φ min of φ.
The intermediate value φmid = (φmax + φmin) / 2 is calculated to determine the width φw =
φmax−φmid. Then, the time series data of the two frequency-resolved data sets a and b that
are the sources of the sound source stream information is extracted from the time when the
predetermined time goes back from the start time Ts of the stream to the time when the
predetermined time elapses from the end time Te. Then, the phase difference is made to be in
phase by correcting so as to cancel the arrival time difference which is inversely calculated by
the intermediate value φmid.
[0091]
Alternatively, the time-series data of the two frequency-resolved data sets a and b may be always
in phase, with the sound source direction φ at each time by the direction estimation unit 311 as
φmid. Whether to refer to sound source stream information or φ at each time is determined in
the operation mode, and this operation mode can be set and changed as a parameter.
[0092]
[Adaptive array processing unit 317] The adaptive array processing unit 317 directs the center
04-05-2019
31
directivity to the front 0 ° of the time-series data of the two frequency-resolved data sets a and
b subjected to extraction and in phase, and predetermined to ± φw. By applying adaptive array
processing in which the value to which the margin is added is set as the tracking range, timeseries data of frequency components of the sound source sound of the stream is separated and
extracted with high accuracy. Although this method differs in method, it functions in the same
manner as the sound source component estimation unit 312 in that time series data of frequency
components is separated and extracted. Therefore, the sound source sound resynthesizing unit
313 can also re-synthesize the amplitude data of the sound source sound from the time series
data of the frequency component of the sound source sound by the adaptive array processing
unit 317.
[0093]
As adaptive array processing, as described in reference 3 Amada et al. Microphone array
technology for speech recognition , Toshiba review 2004, VOL. 59, NO. 9, 2004 , the beam
itself is a beam. It is possible to apply a method for clearly separating and extracting speech
within a set directivity range, such as using Griffith-Jim type generalized sidelobe canceller
known as a construction method of a former as the main and sub two. it can.
[0094]
Usually, when using adaptive array processing, in order to set the tracking range in advance and
to use only the voice from that direction to wait, in order to wait for voice from all directions, a
large number of adaptive arrays with different tracking ranges Needed to be prepared.
On the other hand, in this embodiment, after the number of sound sources and the direction
thereof are actually obtained, only the adaptive array of the number according to the number of
sound sources can be operated, and the following range is also predetermined according to the
direction of the sound sources. Since a narrow range can be set, speech can be separated and
extracted efficiently and with good quality.
[0095]
At this time, by making the time series data of the two frequency resolved data sets a and b in
phase in advance, it is possible to process the sound in any direction only by setting the tracking
range in the adaptive array processing only in the front. It will be.
04-05-2019
32
[0096]
[Speech Recognition Unit 318] The speech recognition unit 318 analyzes and collates time-series
data of frequency components of the sound source extracted by the sound source component
estimation unit 312 or the adaptive array processing unit 317 to obtain symbolic content of the
stream. That is, it extracts symbols (strings) representing linguistic meanings, types of sound
sources, and speakers.
[0097]
In addition, each functional block from the direction estimation unit 311 to the voice recognition
unit 318 can exchange information by a wire connection not shown in FIG. 20 as necessary.
[0098]
[Output Unit 7] The output unit 7 generates sound signal sources estimated by the direction
estimation unit 311, the number of sound sources obtained as the number of straight line groups
by the figure detection unit 5 as sound source information by the sound source information
generation unit 6 Spatial existence range of each sound source (angle φ for determining conical
surface), component composition of speech emitted from each sound source estimated by the
sound source component estimation unit 312 (time-series data of power and phase for each
frequency component) Noise separated by sound source (time series data of amplitude values)
synthesized by the sound source / sound resynthesizer 313, noise determined based on the time
series tracking unit 314 and the duration evaluation unit 315 The number of sound sources
excluding the sources, determined by the time series tracking unit 314 and the duration
evaluation unit 315, the temporal existence period of the voice emitted by each sound source is
determined by the frontification unit 316 and the adaptive array unit 317 ,sound source Voice
separation (time-series data of the amplitude value) is determined by the voice recognition unit
318, symbolic content of each source audio is means for outputting information including at
least one of.
[0099]
[User Interface Unit 8] The user interface unit 8 presents various setting contents necessary for
the above-mentioned sound signal processing to the user, accepts setting input from the user,
stores the setting contents in the external storage device, and stores the external contents.
Execute readout from the device, (1) Display of frequency components for each microphone, (2)
Display of phase difference (or time difference) plot diagram (that is, display of two-dimensional
data) shown in FIG. (3) Display of various vote distribution, (4) Display of maximum position, (5)
Display of straight line group on plot, Display of frequency component belonging to (6) straight
04-05-2019
33
line group shown in FIG. As in (7) Display of locus data shown in FIG. 26, various processing
results and intermediate results can be visualized and presented to the user, or desired data can
be selected by the user for more detailed visualization. It is a means.
By doing this, the user can confirm the function of the acoustic signal processing apparatus
according to the present embodiment, make adjustments so that the user can perform a desired
operation, and thereafter use the apparatus in an adjusted state. Become possible.
[0100]
[Flowchart of Processing] FIG. 27 is a flowchart showing a flow of processing performed by the
acoustic signal processing device according to the present embodiment.
This process includes an initial setting process step S1, an acoustic signal input process step S2, a
frequency resolution process step S3, a two-dimensional data conversion process step S4, a
figure detection process step S5, and a sound source information generation process step S6. It
has an output processing step S7, an end determination processing step S8, a confirmation
determination processing step S9, an information presentation / setting acceptance processing
step S10, and an end processing step S11.
[0101]
The initial setting process step S1 is a process step of executing a part of the process in the user
interface unit 8 described above, reads out various setting contents necessary for acoustic signal
processing from the external storage device, and sets the device into a predetermined setting
state. initialize.
[0102]
The sound signal input processing step S2 is a processing step for executing the processing in
the sound signal input unit 2 described above, and inputs two sound signals captured at two
positions which are not spatially identical.
[0103]
The frequency resolution processing step S3 is a processing step for executing the processing in
the above-described frequency resolution unit 3. The frequency resolution processing step S3
04-05-2019
34
performs frequency resolution on each of the input acoustic signals according to the acoustic
signal input processing step S2, and Calculate the power value if necessary.
[0104]
The two-dimensional data conversion processing step S4 is a processing step for executing the
processing in the two-dimensional data conversion unit 4 described above, comparing the phase
value for each frequency of each input acoustic signal calculated in the frequency resolution
processing step S3. The phase difference value for each frequency is calculated, and the phase
difference value for each frequency is taken as a point on the XY coordinate system with the
function of frequency as Y axis and the function of phase difference as X axis. And (x, y)
coordinate value uniquely determined by the phase difference thereof.
[0105]
The figure detection process step S5 is a process step for executing the process in the figure
detection unit 5 described above, and detects a predetermined figure from the two-dimensional
data at the two-dimensional data conversion process step S4.
[0106]
The sound source information generation processing step S6 is a processing step for executing
the processing in the sound source information generation unit 6 described above, and based on
the information of the figure detected in the figure detection processing step S5, a sound source
which is a generation source of the acoustic signal. , The spatial existence range of each sound
source, the component configuration of the sound emitting each sound source, the separated
voice for each sound source, the temporal existence period of the sound emitting each sound
source, and each sound source Sound source information including at least one of symbolic
content of speech is generated.
[0107]
The output processing step S7 is a processing step for executing the processing in the output
unit 7 described above, and outputs the sound source information generated in the sound source
information generation processing step S6.
[0108]
The termination determination processing step S8 is a processing step for executing a part of the
processing in the user interface unit 8 described above, and the presence or absence of a
termination instruction from the user is checked, and the termination processing step S11 is
04-05-2019
35
performed when there is a termination instruction. If there is no (left branch), the flow of
processing is controlled to the confirmation determination processing step S9 (upper branch).
[0109]
Confirmation determination processing step S9 is a processing step for executing part of the
processing in the user interface unit 8 described above, and checks the presence or absence of a
confirmation instruction from the user, and if there is a confirmation instruction, information
presentation / setting The flow proceeds to the reception processing step S10 (left branch), and if
there is not, the sound signal processing step S2 (upper branch) and the flow of processing are
controlled.
[0110]
The information presentation / setting reception processing step S10 is a processing step for
executing a part of the processing in the user interface unit 8 described above which is executed
in response to a confirmation instruction from the user, and various settings necessary for the
acoustic signal processing Presentation of contents to the user, reception of setting input from
the user, saving of setting contents to the external storage device by save instruction, reading of
setting contents from the external storage device by read instruction, various processing results
The user can confirm the operation of the sound signal processing or perform the desired
operation by visualizing the intermediate result and presenting it to the user or letting the user
select desired data for more detailed visualization. And allow the process to continue in the
adjusted state.
[0111]
The termination processing step S11 is a processing step for executing a part of the processing
in the user interface unit 8 described above which is executed in response to a termination
instruction from the user, and stores external settings of various setting contents necessary for
audio signal processing. Automatically execute saving to the device.
[0112]
Modified Example Here, a modified example of the embodiment described above will be
described.
[0113]
04-05-2019
36
[Detection of Vertical Line] The two-dimensional data conversion unit 4 causes the coordinate
value determination unit 302 to set the X coordinate value as the phase difference ΔPh (fk) and
the Y coordinate value as the frequency component number k as shown in FIG. Generated.
At this time, it is possible to set the X coordinate value to an estimated value ΔT (fk) = (ΔPh (fk)
/ 2π) × (1 / fk) for each frequency of the arrival time difference further calculated from the
phase difference ΔPh (fk) It is.
If arrival time differences are used instead of phase differences, points with the same arrival time
differences, i.e. originating from the same sound source will be aligned on a vertical straight line.
[0114]
At this time, the higher the frequency, the smaller the time difference ΔT (fk) that can be
expressed by ΔPh (fk).
As schematically shown in FIG. 28A, assuming that the time represented by one cycle of the wave
290 of the frequency fk is T, the time that can be represented by one cycle of the wave 291 of
the double frequency 2fk is T / 2. And become half.
At this time, assuming that the X axis is a time difference as shown in FIG. 28A, the range is ±
Tmax, and no time difference is observed beyond this range.
However, at low frequencies below the limit frequency 292 where Tmax is 1/2 cycle (that is, π)
or less, the arrival time difference ΔT (fk) can be uniquely determined from the phase difference
ΔPh (fk), but the limit frequency 292 is exceeded. At high frequencies, the calculated ΔT (fk) is
smaller than the theoretically possible Tmax, and only the range between straight lines 293 and
294 can be represented as shown in FIG.
This is the same problem as the phase difference circulation problem described above.
[0115]
04-05-2019
37
Therefore, in order to solve the problem of the phase difference circulation, as shown
schematically in FIG. 29, the coordinate value determining unit 302 determines 2π and 4π for
one ΔPh (fk) in a frequency range exceeding the limit frequency 292. Also, redundant points are
generated within the range of ± Tmax at the position of ΔT corresponding to the phase
difference obtained by adding or subtracting 6π, etc., and are made into two-dimensional data.
The generated point group is a black circle in the figure, and in the frequency range exceeding
the limit frequency 292, a plurality of black circles are plotted for one frequency.
[0116]
By doing this, from the two-dimensional data generated as one or a plurality of points for one
phase difference value, the strong vertical lines (295 in the figure) from the voting unit 303 and
the straight line detection unit 304 Polling makes it possible to detect.
At this time, since the vertical line is a straight line such that θ = 0 in the Hough voting space,
the detection problem of the vertical line is a vote having a predetermined threshold or more at
the maximum position on the ρ axis where θ = 0 in the vote distribution after Hough voting. It
can be solved by detecting what you get.
The ρ value of the maximum position detected here gives the estimated value of the intersection
of the vertical line and the X axis, ie, the arrival time difference ΔT.
When voting, the voting conditions and the addition method described in the explanation of the
voting unit 303 can be used as they are.
Also, the straight line corresponding to the sound source is not a straight line group but a single
vertical line.
[0117]
04-05-2019
38
The problem of finding the maximum position is to obtain votes above a predetermined threshold
at the maximum position on the one-dimensional vote distribution (peripheral distribution
projection-voted in the Y-axis direction) by voting the X coordinate values of the redundant point
group described above It can also be solved by detecting
In this way, by using the arrival time difference for the X axis instead of the phase difference, all
the evidences representing the sound sources present in different directions are copied to a
straight line with the same slope (ie perpendicular) Both can be easily detected by the
surrounding distribution.
[0118]
Information on the sound source direction obtained by obtaining the vertical line is the arrival
time difference ΔT obtained as ρ not θ.
Therefore, the direction estimation unit 311 can immediately calculate the sound source
direction φ from ΔT without intervening θ.
[0119]
As described above, the two-dimensional data by the two-dimensional data conversion unit 4 is
not limited to one type, and the method of detecting figures by the figure detection unit 5 is not
limited to one.
Note that the plot of the point group using the arrival time difference illustrated in FIG. 29 and
the detected vertical line are also objects of information presentation to the user by the user
interface unit 8.
[0120]
[Parallel Implementation of Multiple Systems] Further, although the above example has been
described in the simplest configuration having two microphones, as shown in FIG. 30, N (N ≧ 3)
microphones are provided and the maximum is provided. It is also possible to configure M (1 ≦
04-05-2019
39
M ≦ NC2) microphone pairs.
[0121]
Reference numerals 11 to 13 in the figure denote N microphones.
In the figure, 20 is a means for inputting N acoustic signals by N microphones, and 21 in the
figure is a means for frequency-resolving the input N acoustic signals. In the figure, 22 is a
means for generating two-dimensional data for each of M (1 ≦ M ≦ NC2) pairs of two out of N
acoustic signals, and 23 in the figure is a generated M It is a means for detecting a
predetermined figure from each set of two-dimensional data. In the figure, reference numeral 24
denotes means for generating sound source information from each of the detected M sets of
graphic information, and reference numeral 25 in the figure denotes means for outputting the
generated sound source information. In the figure, reference numeral 26 denotes presentation to
the user of various setting values including information of microphones constituting each pair,
reception of setting input from the user, storage of setting values in the external storage device,
setting values from the external storage device Means for reading out and presenting various
processing results to the user. The processing in each microphone pair is similar to the abovedescribed embodiment, and such processing is performed in parallel for a plurality of
microphone pairs.
[0122]
In this way, even if one microphone pair has poor direction, it is possible to reduce the risk of
losing correct sound source information by covering with a plurality of pairs.
[0123]
[Implementation Using a General-Purpose Computer: Program] Further, as shown in FIG. 31, the
present invention can also be implemented as a general-purpose computer capable of executing a
program for realizing the audio signal processing function according to the present invention. .
Reference numerals 31 to 33 in the figure denote N microphones. 40 in the figure is A / D
conversion means for inputting N acoustic signals by N microphones, and 41 in the figure is a
CPU for executing program instructions for processing the inputted N acoustic signals. It is.
04-05-2019
40
Reference numerals 42 to 47 in the figure denote standard devices constituting a computer,
which are the RAM 42, the ROM 43, the HDD 44, the mouse / keyboard 45, the display 46, and
the LAN 47, respectively. Reference numerals 50 to 52 in the figure denote drives for supplying
programs and data to the computer from the outside via a storage medium, and are a CDROM 50,
an FDD 51, and a CF / SD card 52, respectively. Reference numeral 48 in the figure is a D / A
conversion means for outputting an acoustic signal, and a speaker 49 is connected to the output.
This computer apparatus stores an acoustic signal processing program for executing the
processing steps shown in FIG. 27 in the HDD 44, reads it into the RAM 42, and executes the
program by the CPU 41 to function as an acoustic signal processing apparatus. The function of
the user interface unit 8 is realized by using the HDD 44 as an external storage device, the mouse
/ keyboard 45 for receiving operation input, and the display 46 and the speaker 49 as
information presenting means. Further, the sound source information obtained by the acoustic
signal processing is stored and output in the RAM 42, the ROM 43 or the HDD 44, or is output in
communication via the LAN 47.
[0124]
[Recording Medium] The present invention can also be implemented as a computer readable
recording medium as shown in FIG. The reference numeral 61 in the figure is a recording
medium realized by a CD-ROM, CF, SD card, floppy (registered trademark) disk or the like on
which the audio signal processing program according to the present invention is recorded. The
program can be executed by inserting the recording medium 61 into an electronic device 62
such as a television or computer, the electronic device 63, or the robot 64, or another electronic
device 65 can be communicated from the electronic device 63 supplied with the program. By
supplying the program to the robot 64, the program can be executed on the electronic device 65
or the robot 64.
[0125]
Correction of Sound Velocity by Temperature Sensor The present invention also includes a
temperature sensor for measuring the outside air temperature in the device, and corrects the
sound velocity Vs in FIG. 22 based on the air temperature data measured by the temperature
sensor. It is also possible to carry out so as to obtain a proper Tmax.
[0126]
Alternatively, the present invention comprises transmitting means and receiving means of sound
waves disposed in the apparatus at predetermined intervals, and measuring the time until the
sound wave emitted from the transmitting means reaches the receiving means by the measuring
04-05-2019
41
means It is also possible to calculate and correct the sound velocity Vs directly to determine the
correct Tmax.
[0127]
[Unequal θ of θ for Equalization of φ] In addition, according to the present invention, when
performing Hough transform to obtain the inclination of a group of straight lines, quantization is
performed such that θ is in steps of 1 °, for example. Thus, the values of the sound source
direction φ that can be estimated when θ is equally spaced are quantized at unequal intervals.
Therefore, the present invention can also be implemented such that coarseness and density do
not easily occur in the estimation accuracy of the sound source direction by performing
quantization of θ so that φ is equally spaced.
[0128]
The method described in the above non-patent document 2 estimates the number, directions, and
components of the sound sources by detecting the fundamental frequency component
constituting the harmonic structure and its harmonic component from the frequency-resolved
data.
Given the harmonic structure, this method can be said to be specialized for human voice.
However, in a real environment, many sound sources without harmonic structure, such as door
open / close sounds, can not handle such sound sources by this method.
[0129]
In addition, although the method described in Non-Patent Document 1 described above is not
limited to a specific model, as long as two microphones are used, the number of sound sources
that can be handled is limited to one.
[0130]
04-05-2019
42
On the other hand, according to an embodiment of the present invention, by dividing the phase
difference for each frequency component into groups for each sound source using Hough
transform, two or more sound sources are localized and separated using two microphones.
Function can be realized.
At this time, since a limited model such as a harmonic structure is not used, it can be applied to a
sound source of a wider range of properties.
[0131]
The other functions and effects exerted by the embodiment of the present invention are
summarized as follows.
[0132]
A wide variety of sound sources can be detected stably by using a voting method suitable for
detecting a sound source with many frequency components and a sound source with high power
in Hough voting.
[0133]
The sound source can be detected efficiently and accurately by taking into consideration the
restriction of = 0 = 0 and the phase difference circulation at the time of linear detection.
[0134]
-Using the straight line detection results, the spatial existence range of the sound source that is
the generation source of the sound signal, the temporal existence period of the sound source
sound that emitted the sound source, the component configuration of the sound source sound,
the separated sound of the sound source sound, the sound source sound Useful sound source
information including symbolic content can be determined.
[0135]
-When estimating the frequency component of each sound source sound, simply select a
component in the vicinity of a straight line, determine which straight line a certain component
belongs to, multiply the coefficients according to the distance between each straight line and the
component By doing this, the source sound can be separated individually by a simple method.
04-05-2019
43
[0136]
By knowing the direction of each sound source in advance, the directivity range of adaptive array
processing can be adaptively set, and the sound source sound can be separated with higher
accuracy.
[0137]
The symbolic content of the sound source sound can be determined by separating and
recognizing each sound source sound with high accuracy.
[0138]
-It becomes possible for the user to confirm the operation of the device, to make adjustments so
that the user can perform a desired operation, and to use the device in a state after adjustment.
[0139]
The present invention is not limited to the above embodiment as it is, and at the implementation
stage, the constituent elements can be modified and embodied without departing from the scope
of the invention.
In addition, various inventions can be formed by appropriate combinations of a plurality of
constituent elements disclosed in the above embodiment.
For example, some components may be deleted from all the components shown in the
embodiment.
Furthermore, components in different embodiments may be combined as appropriate.
[0140]
Functional block diagram of an acoustic signal processing apparatus according to an
embodiment of the present invention Diagram diagram showing a sound source direction and a
diagram frame showing an arrival time difference observed in an acoustic signal Procedure of an
FFT process showing a frame shift amount and short time Functional block diagram showing the
internal configuration of each of the two-dimensional data converting unit and the figure
04-05-2019
44
detecting unit showing Fourier transform data Diagram showing the procedure of phase
difference calculation Diagram showing the procedure of coordinate value calculation Figure
showing the proportional relationship of phase difference and the proportional relationship
between phase difference and frequency for the same time difference Figure for explaining the
cyclicity of phase difference Plot of frequency and phase difference in the presence of multiple
sound sources Linear Hough transform Figure for explaining the figure Figure for explaining the
detection of a straight line from a point group by Hough transform Figure showing a function
(calculation formula) of average power to be voted Figure showing the relationship between the
maximum position and the straight line obtained from the frequency component, the phase
difference plot, and the Hough voting result that was generated from the voice at the time Figure
2 showing the relationship between Δρ Figure showing the frequency component at the time of
simultaneous speech, phase difference plot, and Hough voting result Figure showing the result of
searching for the maximum position only with the vote value on the θ axis Figure showing the
result of searching for a function block diagram showing the internal configuration of the sound
source information generation unit Diagram for explaining direction estimation Figure showing
the relationship between θ and ΔT About sound source component estimation (distance
threshold method) in the presence of multiple sound sources Figure showing an example of
calculation formula of figure coefficient α for explaining the figure nearest neighbor method for
explanation and figure showing the tracing on the time axis of figure φ showing the graph
Figure showing the flow of processing performed by the acoustic signal processing device
Flowchart frequency and representation Figure showing the relationship with possible time
differences Plot of time differences when generating redundant points Functional block diagram
of acoustic signal processing apparatus according to a modified embodiment comprising N
microphones Acoustic signal processing function according to the present invention Functional
block diagram according to an embodiment realized using a general purpose computer Diagram
showing an embodiment according to a recording medium recording a program for realizing an
acoustic signal processing function according to the present invention
Explanation of sign
[0141]
DESCRIPTION OF SYMBOLS 1a, 1b ... Microphone; 2 ... Acoustic signal input part; 3 ... Frequency
resolution part; 4 ... Two-dimensional data conversion part; 5 ... Picture detection part; 6 ... Sound
source information generation part; 7 ... Output part;
04-05-2019
45
1/--страниц
Пожаловаться на содержимое документа