close

Вход

Забыли?

вход по аккаунту

JP2004109712

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2004109712
An object of the present invention is to eliminate the influence of the vocal tract and to improve
the accuracy of the determination of speech / noise even for speech with weak signal. A linear
prediction circuit performs linear prediction on input signals from microphones and generates a
linear prediction signal. Subtractors A and B subtract the linear prediction signal from the input
signal from the microphone to generate a residual signal. The evaluation function operation
circuit 5 detects the maximum value of this evaluation function value using an evaluation
function using a relation formula of an autocorrelation function and a cross correlation function.
The direction detection circuit 6 detects the direction of the speaker based on the phase
difference obtained by the evaluation function calculation circuit 5. Further, the number of zero
crossings detection circuit 7 detects the number of zero crossings of the input signal from the
microphone. The noise / noise determination circuit 8 performs noise / noise determination
based on the number of zero crossings, and when noise is determined, the direction detection
output from the direction detection circuit 6 is stopped to prevent malfunction. [Selected figure]
Figure 1
Speaker direction detector
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a
speaker direction detecting device, and more particularly to a video conference and a video
camera for voice input and a microphone for voice input as in a television conference device, and
the installation location is fixed. BACKGROUND OF THE INVENTION 1. Field of the Invention The
present invention relates to an apparatus for detecting a direction of a speaker viewed from an
apparatus using an audio signal in order to control an imaging angle of a video camera. [0002]
When trying to control the imaging angle of a video camera using the output signal of a speaker
direction detection device, if a detection error occurs in the speaker direction, the video camera
04-05-2019
1
moves in a direction other than the speaker. There is a problem in that the user is faced with a
problem such as turning to a television conference apparatus or the like. Therefore, this
conventional speaker direction detection apparatus of this type estimates an arrival time
difference caused by a difference in distance at which a voice signal reaches two microphones
using an addition value for each time difference of the cross correlation function. And means for
detecting the direction of the speaker by detecting the maximum value of the added value of the
cross-correlation function (for example, see Patent Document 1). Since the arrival time difference
when the cross correlation function value becomes the maximum value maximizes the
autocorrelation function value, the direction of the sound wave is calculated and estimated using
the arrival time difference, and the estimation result is the speaker direction Converted to Then,
since the cross-correlation function value is added for a certain time (statistical processing) and
the maximum value search is performed, the occurrence of a search error can be minimized. By
adopting such a configuration, it is possible to reduce a detection error in the speaker direction
even if a signal coming from a direction other than the speaker is superimposed on the voice
signal of the speaker. In the above-described prior art, as a specific example of the evaluation
function, a relational expression of an autocorrelation function and a cross correlation function is
used, and further, a product of dividing the square of the cross correlation function by the
autocorrelation function as the relational expression It is presented. By the way, when the
speaker direction detecting device is miniaturized, since the distance between the microphones is
not particularly wide, the relationship of the waveform of the voice sample inputted to each
microphone is the same as that of the delay etc. It originates from a sound source and can be
regarded as substantially the same waveform. Thus, the cross-correlation results of the inputs to
the two microphones can be approximated with autocorrelation including delays, when
considered as an input to one microphone. It is well known in the art that autocorrelation may be
used for the pitch detection method, and generally speaking, since the sound source information
and the vocal tract information are mixed in the speech waveform, the influence of the vocal tract
is It is known that the extraction error is reduced if the pitch detection is performed after
removing the (see non-patent document 1, for example).
Also, in the industry, the speech spectrum can be separated into a relatively slowly changing
spectrum envelope and a spectrum fine structure changing in a short time, and the former
corresponds to the resonance characteristic of the vocal tract and the latter corresponds to the
driving sound source characteristic It is well known that Furthermore, it is also known that the
residual signal after linear prediction analysis (see, for example, Non-Patent Document 2) has flat
spectral envelope characteristics and has only the fine structure information of the spectrum
based on the driving sound source. . Further, in the technique described in the above-mentioned
publication, although the input signal from the microphone or the autocorrelation value
calculated from the input signal is compared with the threshold value, the noise / noise
determination is performed. Related to this, a technique is known that uses the number of zero
crossings per unit time of a signal waveform to identify a voice section and a silent section (see,
04-05-2019
2
for example, Non-Patent Document 3). [Patent Document 1] Japanese Patent Application LaidOpen No. 2001-236092 (Page 1-6, FIG. 1-FIG. 7) [Non-Patent Document 1] Noritaka Kitawaki et
al. "Sound Communication Engineering" Corona Corporation, 1996 March 30, P.I. 22-23 (NonPatent Document 2) Takei Yasui et al., "Computer speech processing" Akiba Publishing, June 20,
1988, P. 43-46 [Non-Patent Document 3] Suzuki Hisaki "Digital Signal Processing of Speech"
Corona Publishing, April 15, 1983, P136-141. SUMMARY OF THE INVENTION Although the
driving sound source information whose content is determined by the movement of the mouth
and the vocal tract information of the uniform content emitted over the throat are mixed, the
accuracy of the autocorrelation increases when the vocal tract information is subtracted. become.
However, in the technique described in the above-mentioned publication, there is a first problem
that the accuracy of the autocorrelation accuracy is lowered as compared with the case where
the vocal tract information is removed since there is no description about the removal of the
vocal tract information. . In the technique described in the above-mentioned publication, the
noise / noise determination is performed by comparing the input signal from the microphone or
the autocorrelation value calculated from the input signal with the threshold value. However, the
voice from a speaker relatively far from the microphone has a second problem that the sensitivity
of direction detection tends to be reduced because the power (sound pressure) is small and it is
easy to be judged as noise. Therefore, a first object of the present invention is to provide a
speaker direction detecting device with reduced malfunctions and improved stability by
removing the influence of vocal tract information from an input signal from a microphone. .
The second object of the present invention is to detect the voice / noise of the input signal from
the microphone and eliminate the influence of the ambient noise, thereby reducing the erroneous
operation and enhancing the stability of the speaker direction detection. It is in providing an
apparatus. According to the present invention, there is provided a speaker direction detecting
apparatus comprising: a correlation between an autocorrelation function and a cross correlation
function for estimating an arrival time difference caused by a difference in distance at which a
speech signal reaches two microphones; As input signals to the evaluation function using the
equation, linear prediction is performed on each of the input signals from the microphones, and
the linear prediction signals are subtracted from the input signals from the microphones to use
the signal obtained by removing the influence of the vocal tract of speech It is characterized by
More specifically, the speaker direction detection device of the present invention performs linear
prediction on the input signal from the microphone and generates a linear prediction signal, and
the linear prediction circuit (3, 4 in FIG. 1) corresponding to the microphone Using a
microphone-compatible subtractor (A, B in FIG. 1) that subtracts the linear prediction signal from
the input signal and removes the influence of the vocal tract (A and B in FIG. 1) Based on the
phase difference of the input signal obtained by the evaluation function operation circuit (5 in
FIG. 1) that detects the maximum value of the evaluation function value for the residual signal
using the evaluation function and the phase difference of the input signal obtained by the
maximum value And a direction detection circuit (6 in FIG. 1) for detecting the direction of the
04-05-2019
3
person. It is known that the accuracy of the speaker direction detection is improved by removing
the influence of the vocal tract. Therefore, the present invention performs linear prediction on
the input signal from the microphone, and after linear prediction analysis Apply the
autocorrelation function and the cross correlation function to the residual signal of to remove the
influence of the vocal tract by improving the evaluation result of the evaluation function defined
by the autocorrelation function and the cross correlation function. It is. Further, the speaker
direction detecting device according to the present invention determines the voiced and noise by
detecting the number of times of zero crossing of short time average with respect to the input
signal from the microphone, and is determined as voice. The above-mentioned speaker direction
detection may be performed only in the case where it is determined that the direction detection
is stopped when it is determined that the ambient noise is present. This makes it possible to
accurately detect even speech from a speaker relatively far from the microphone. In the
determination of speech and noise, the number of zero crossings of the signal is relatively small
in speech (strictly speaking only in the speech part), and noise (also strictly speaking unvoiced
part of speech).
) Is based on the speech theory that it is relatively large. DESCRIPTION OF THE PREFERRED
EMBODIMENTS Next, embodiments of the present invention will be described with reference to
the drawings. DESCRIPTION OF THE PREFERRED EMBODIMENT FIG. 1 is a block diagram
showing an embodiment of a speaker direction detecting apparatus according to the present
invention. This speaker direction detecting device includes two linear prediction circuits 3 and 4,
two subtractors A and B, an evaluation function calculation circuit 5, a direction detection circuit
6, a zero crossing frequency detection circuit 7 and a noise / noise determination circuit 8 It
consists of When an audio signal is input to the microphone 1 and an input signal 1S from the
microphone 1 is input to the linear prediction circuit 3, the linear prediction circuit 3 performs
linear prediction on the input signal 1S to generate a linear prediction signal 3S. The subtracter A
subtracts the linear prediction signal 3S from the input signal 1S to generate a residual signal AS
from which the influence of the vocal tract has been removed. Similarly, when the input signal 2S
from the microphone 2 is input to the linear prediction circuit 4, the linear prediction circuit 4
performs linear prediction on the input signal 2S to generate a linear prediction signal 4S. The
subtractor B subtracts the linear prediction signal 4S from the input signal 2S to generate a
residual signal BS from which the influence of the vocal tract has been removed. The residual
signals AS and BS are input to the evaluation function calculation circuit 5. The evaluation
function calculation circuit 5 detects the maximum value of the evaluation function for the
residual signals AS and BS using, for example, an evaluation function using a relational
expression of an autocorrelation function and a cross correlation function, and inputs obtained
here The direction detection circuit 6 detects the direction of the speaker based on the phase
difference between the signal 1S and the input signal 2S. Further, the number of times of zero
crossing detection circuit 7 detects the number of times of zero crossing at which the values of
the input signals 1S and 2S become zero within a predetermined time. Based on the zero crossing
04-05-2019
4
frequency information, the voice / noise determination circuit 8 determines whether the voice
signal input from the certain sound source to the microphones 1 and 2 is due to noise or due to
the speaker, and is determined as noise. In this case, the update of the direction detection circuit
6 is stopped, and the detection signal direction at this time is controlled so as not to indicate the
position direction that is considered to be a noise source. FIG. 2 is a block diagram showing an
embodiment of a speaker direction detection apparatus according to the present invention. This
speaker direction detecting device includes three memories 14, 15 and 16, four linear prediction
circuits 17, 18, 19 and 20, four subtractors 21, 22, 23 and 24, two evaluation function operation
circuits 29, 30 and a direction detection circuit 31. The memories 14, 15, and 16 hold input
signals 11 S, 12 S, and 13 S from the microphones 11, 12, and 13, respectively.
The microphones 11 and 12 are placed horizontally as shown in FIG. 3 and are used to find the
horizontal position of the speaker, and the microphones 12 and 13 are placed vertically to talk
Used for vertical position search of the person. The linear prediction circuit 17 receives the
memory 14, the linear prediction circuits 18 and 19 receive the memory 15, and the linear
prediction circuit 20 receives the input signals 11 S, 12 S, and 13 S from the microphones 1, 2,
and 3 obtained through the memory 16. Then, linear prediction is performed to generate linear
prediction signals 17S, 18S, 19S and 20S. In the simplest example, the linear prediction circuits
17 to 20 can be realized by a several-order FIR (Finite Impulse Response) filter or the like. The
subtractors 21, 22, 23, 24 subtract the linear prediction signals 17S, 18S, 19S, 20S from the
signals obtained through the memories 14, 15, 15, 16, and the residual signals 21S, 22S, 23S,
24S. Generate An autocorrelation operation circuit 25 calculates an autocorrelation function
value 25S for the residual signal 22S using an autocorrelation function, and an autocorrelation
operation circuit 26 uses an autocorrelation function to calculate an autocorrelation function
value for the residual signal 23S. Calculate 26S. Thereby, the cross correlation function value for
each time difference in the horizontal direction is calculated. Further, the cross correlation
function circuit 27 calculates the cross correlation function value 27S for the residual signals
21S and 22S using the cross correlation function, and the cross correlation function circuit 28
uses the cross correlation function for the residual signals 23S and 24S. The cross correlation
function value 28S is calculated. Thereby, the cross correlation function value for each time
difference in the vertical direction is calculated. The evaluation function operation circuit 29
calculates an evaluation function value 29S according to the evaluation function based on the
autocorrelation function value 25S and the cross correlation function value 27S, and passes it to
the position detection circuit 31. The position detection circuit 31 determines the delay position
at which the evaluation function value 29S is maximum, and the direction corresponding to the
delay is the horizontal direction. Note that the maximum value search is performed after an
addition process for a predetermined time is performed on the evaluation function value 29S.
Similarly, the evaluation function circuit 30 calculates an evaluation function value 30S
according to the evaluation function based on the autocorrelation function value 26S and the
cross correlation function value 28S, and passes it to the position detection circuit 31. The
04-05-2019
5
position detection circuit 31 determines the delay position at which the evaluation function value
30S is maximum, and the direction corresponding to the delay is the vertical direction.
Also in this case, the maximum value search is performed after the addition processing for the
evaluation function value 30S for a predetermined time. The evaluation function operation
circuits 29 and 30 detect the maximum value of the value calculated by this evaluation function
using, for example, an evaluation function using a relational expression of an autocorrelation
function and a cross correlation function, and The direction of the speaker is detected based on
the obtained phase difference. In the above description, the results of the linear prediction
circuits 18 and 19, the subtractors 22 and 23, and the autocorrelation circuits 25 and 26 are the
same. It may be input to 30. FIG. 4 is a block diagram showing another embodiment of the
present invention. In this embodiment, a zero crossing frequency detection circuit 32 and a noise
/ noise determination circuit 33 are added to the embodiment shown in FIG. The same reference
numerals are assigned to the same components in FIG. 4 and FIG. It is the same as that of the
embodiment of FIG. 2 until the direction detection circuit 31 detects the horizontal direction and
the vertical direction. In this embodiment, the input signals 14S-16S from the microphones 1113 stored in the memories 14-16 are input to the zero crossing frequency detection circuit 32,
where the input signals 14S-16S in a short time Calculate the number of zero crossings. The
speech / noise determination circuit 33 performs speech / noise determination based on the
result. Then, when it is determined to be a noise section, the direction output from the direction
detection circuit 31 is stopped to prevent malfunction due to the influence of the noise source.
Description of Operation Next, the operation of the embodiment of FIG. 4 will be described with
reference to the flowchart of FIG. This flowchart shows processing executed independently in
each of the horizontal direction and the vertical direction. First, the count value CNT for detecting
the addition time of the evaluation result in the evaluation function operation circuits 29, 30 is
initialized (step S1 in FIG. 5). When voice data is input from the corresponding microphone (step
S2), linear prediction processing (step S3), autocorrelation calculation (step S4) and crosscorrelation calculation (step S5) are performed every 32 to 40 samples at 16 kHz sampling. In
these linear prediction processing (step S3), autocorrelation calculation (step S4) and cross
correlation calculation (step S5), are audio data stored in the memories 14 to 16 and processed
at once in frame units? Alternatively, calculation may be performed by dividing into parts for
each sample.
FIG. 5 shows the latter case. In this case, it is determined whether the correlation function
calculation has been completed (step S6), and if not completed (NO in step S6), the process
returns to speech data input (step S2). When the correlation function calculation is completed
(YES in step S6), linear prediction processing including generation of residual signal (step S3) is
obtained by autocorrelation calculation (step S4) and cross correlation calculation (step S5)
Based on the result, the evaluation function value is calculated (step S7). This evaluation function
04-05-2019
6
value is calculated using (square of cross correlation / autocorrelation) as an evaluation function.
Next, in order to statistically average the obtained evaluation function values, the evaluation
results are added to the previous results and accumulated (step S 8). Then, since the count value
CNT is used to measure the accumulation time of the evaluation result, it is incremented by one
and updated (step S9). The updated count value CNT is compared with the preset value MAX for
confirmation of the count value CNT (step S10). Setting value: MAX may be an arbitrary value of
about 200 ms to 1 s. When the count value CNT does not reach the set value MAX as a result of
comparison (NO in step S10), the process returns to voice data input (step S2). Setting value:
When it becomes MAX or more (YES in step S10), after the count value CNT is initialized to 0
(step S11), the maximum value search of the evaluation result is performed, and the time
difference when it becomes the maximum (delay ) Is detected (step S12). Finally, the addition
result of the evaluation function value is initialized (step S13). This prepares for the addition of
the evaluation function value which newly starts next. In addition, since the number of zero
crossings of the microphones 11 to 13 is detected during this detection period, the presence /
noise determination is made based on the result (step S 14), and it is determined that there is a
presence. The direction is calculated from the time difference (delay) only (YES in step S15) (step
S16). In the determination of sound / noise (steps S14 and S15), a method of setting a sound
interval only when all input signals from the microphones 11 to 13 are determined to be sound
intervals, or a method from microphones 11 to 13 When it is determined that any of the input
signals is voiced, there is a method of setting as a voiced section or the like. The speaker
direction detection method described above can also be performed by executing a program in a
computer that constitutes the speaker direction detection apparatus.
The program controls the computer to perform, for example, processing similar to that shown in
FIG. The present invention is not limited to the above-described embodiments, and it is apparent
that the respective embodiments can be appropriately modified within the scope of the technical
idea of the present invention. As described above, according to the present invention, since the
influence of the vocal tract is removed from the input signal from the microphone using the
linear prediction circuit, a plurality of signals calculated by the evaluation function operation
circuit are used. It has the first effect that it is possible to improve the detection accuracy of the
phase difference of the input signal from the microphone and to reduce the detection error of the
speaker direction. In addition, noise / noise determination based on the short-time zero-crossing
function detection of the input signal from the microphone is performed to eliminate the
influence of ambient noise, thereby reducing malfunction of the speaker direction detection
device. It has the 2nd effect that it becomes possible to improve stability. BRIEF DESCRIPTION OF
THE DRAWINGS FIG. 1 is a block diagram showing an embodiment of a speaker direction
detection device according to the present invention. FIG. 2 is a block diagram showing an
embodiment of a speaker direction detection device according to the present invention. The
arrangement of microphones applied to the embodiments shown in FIGS. 2 and 4 FIG. 4 is a
block diagram showing another embodiment of the speaker direction detecting apparatus
04-05-2019
7
according to the present invention. FIG. 5 is an embodiment shown in FIG. Example flow chart
[Description of code] 1, 2, 11 to 13 microphones 14 to 16 memories 3, 4, 17 to 20 linear
prediction circuits 5, 29, 30 evaluation function calculation circuits 6, 31 direction detection
circuits 7, 32 zero crossing Number detection circuit 8, 33 Noise / noise judgment circuit 21-24
Subtractor 25, 26 Autocorrelation operation circuit 27, 28 Cross correlation operation circuit
04-05-2019
8
1/--страниц
Пожаловаться на содержимое документа