close

Вход

Забыли?

вход по аккаунту

JP2009302983

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2009302983
An echo canceller for preventing echo and howling is intended to suppress echo in a short
amount of operation and in a short time as compared with a method based on adaptive
processing without restriction of microphone arrangement. According to the present invention, a
voice input unit 10 in which a plurality of microphones M1 and M2 having directivity in opposite
directions are incorporated in one casing, and a microphone M1 having one directivity of the
voice input unit 10 The signal processing unit 20 performs processing for transmitting the voice
captured in the other party to the other party, and processing for receiving the voice sent from
the other party and outputting the voice to the speaker SP, voice input of the voice captured in
the voice input unit 10 The difference between the signal amounts of the voice taken in by the
microphone M1 having one directivity of the unit 10 and the voice of the other party outputted
from the speaker SP taken in by the microphone M2 having the other directivity of the voice
input unit 10 And a sound source separation unit for separating and transmitting. [Selected
figure] Figure 1
Speech processing apparatus and speech processing method
[0001]
The present invention relates to an audio processing apparatus and an audio processing method
applied to a loud-speaking telephone call system such as hands-free telephone and video
conference.
[0002]
04-05-2019
1
In a high speed voice communication system such as a teleconference system, the sound
collected by the microphone of the far end device is sent to the near end device through a
predetermined line, and is emitted by the speaker of the near end device.
The near-end device is also equipped with a microphone and is configured to send the voice of
the near-end speaker to the far-end device via a predetermined line. Therefore, the sound emitted
from the speaker at each of the far end and the near end is input to the microphone. If no
processing is performed, this voice is sent to the other device again, causing a phenomenon
called "echo" that can be heard from the speaker a little later as the user's voice sounds like a
echo. When the echo becomes large, it is input to the microphone again, looping the system and
causing howling.
[0003]
An echo canceller is known as a device for preventing such echo and howling. In an echo
canceller, generally, an adaptive filter is used to measure an impulse response between a speaker
and a microphone, and a pseudo echo in which the above-mentioned impulse response is
convoluted with a reference signal emitted from the speaker is generated. Then, it is removed by
subtracting the pseudo echo from the sound input to the microphone.
[0004]
Here, the impulse response between the speaker and the microphone changes only by the
change in the sound reflection relationship, such as when the teleconference attendee moves the
body, and it takes some time for the adaptive filter to follow and converge It takes Since the time
between the change of the system and the convergence of the adaptive filter can not generate an
accurate pseudo echo, the echo returns largely, and in a severe case, howling is caused.
[0005]
Also, in general, the amount of operation of the adaptive filter is larger than that of the fast
Fourier transform (FFT) or the filter bank, which is a burden when used in a low cost system.
When used in a large place such as a gymnasium, the distance from the speaker to the
microphone increases and the reverberation time increases, so it is known that the adaptive filter
04-05-2019
2
requires a long tap length, and the amount of calculation further increases. Do.
[0006]
Further, as a method other than the adaptive processing, it is conceivable to apply a sound
source separation method such as "SAFIA" described in Non-Patent Documents 1 and 2. "SAFIA"
is a method of collecting voices of a plurality of speakers with a plurality of microphones and
performing separation based on the power difference.
[0007]
When "SAFIA" is used, the problem of echo return does not occur in the time it takes to converge
if the system changes like an adaptive filter. Furthermore, it can be realized using a filter bank or
Fourier transform, and a reduction method of the calculation amount has already been proposed
for these, and it can be realized with a small amount of calculation as compared with the case of
using an adaptive filter.
[0008]
Also, "SAFIA" is not dependent on the reverberation time of the room used, and there is no
problem of increasing the amount of calculation to cope with the long reverberation time when
using the adaptive filter.
[0009]
Mariko Aoki and 2 others, "Improvement of performance of source separation method SAFIA
under reverberation", Transactions of the Institute of Electronics, Information and
Communication Engineers Vol.
J87-A No. 9, September 2004, p. 1171-1186 Mariko Aoki et al., "Separation and Extraction of
Proximity Sound Source under High Noise Using Sound Source Separation Method SAFIA,"
Transactions of the Institute of Electronics, Information and Communication Engineers Vol. J88-A
No. 4, April 2005, p. 468-479
04-05-2019
3
[0010]
However, the main problem in applying "SAFIA" is the installation of the microphone. That is,
only one microphone is required when using adaptive processing, while two need to be prepared
when using "SAFIA". In addition, one should be placed beside the target speaker and another in
front of the speaker. Therefore, although the sound source separation by "SAFIA" has high
performance, because of such limitations, the place where it can be used as an echo canceller is
limited.
[0011]
The present invention is a process of transmitting voices captured by a microphone having one
directivity of a voice input unit in which a plurality of microphones having directivity in opposite
directions are incorporated in one casing, to the other party , And a signal processing unit that
performs processing for receiving a voice sent from the other party and outputting the voice to a
speaker, and a voice captured by a microphone having one directivity of the voice input unit
among voices captured by the voice input unit And a sound source separation unit that separates
the two from the difference between the signal amounts of the voice of the other party output
from the speaker captured by a microphone having another directivity of the voice input unit,
and transmitting the two.
[0012]
According to the present invention, the voice emitted from the speaker and the voice of the
speaker are taken in by the voice input unit in which the plurality of microphones are
incorporated in one case, and the directivity of the plurality of microphones is Because of the
separation, echo and howling do not occur, and two-way simultaneous call in a speech call
system can be realized.
In particular, since voices are taken in using a voice input unit in which a plurality of
microphones are incorporated in one case, installation of the microphones becomes easy.
[0013]
Further, according to the present invention, an audio input unit for capturing audio with a
plurality of microphones, a process for transmitting the audio captured by the audio input unit to
04-05-2019
4
the other party, and a process for receiving the audio sent from the other party and outputting it
to the speaker Voice processing unit, a beam forming unit for setting the directivity of the voice
taken in from the speaker side and the voice taken in from the speaker side, and the directivity
set in the beam forming unit Based on the difference in signal volume between the voice
captured from the speaker and the voice captured from the speaker for the voice captured by the
input unit, both are separated, and only the voice captured from the speaker obtained separately
is transmitted to the other party And a sound source separation unit.
[0014]
According to the present invention, since the voice emitted from the speaker and the voice of the
speaker are taken in by the voice input unit and separated according to the directivity set by the
beam forming unit, echo and howling do not occur, and the voice is loudened. Simultaneous twoway communication in the communication system is realized.
In particular, since the directivity in the plurality of microphones is set by the beam forming unit,
installation of the plurality of microphones becomes easy.
[0015]
The present invention also includes the steps of capturing voice with a plurality of microphones,
processing of transmitting the captured voice to the other party, and performing processing of
receiving the voice sent from the other party and outputting the voice to the speaker. The
process of setting the directivity of the voice taken in from the speaker side and the voice taken
in from the speaker side, and the set directivity, the voice taken in from the speaker side and the
voice taken in from the speaker side And separating the two based on the difference between the
signal amounts of the two, and transmitting only the voice acquired from the speaker side
obtained by separation to the other party.
[0016]
According to the present invention as described above, since the voice emitted from the speaker
and the voice of the speaker are taken in by the voice input unit and separated according to the
set directivity, echo and howling do not occur, and both in the voice communication system
Simultaneous calls are realized.
04-05-2019
5
In particular, in order to set the directivity in a plurality of microphones, installation of a plurality
of microphones becomes easy.
[0017]
According to the present invention, in an echo canceller for preventing echo and howling, echo
can be suppressed in a short amount of operation and in a short time as compared with the
method based on adaptive processing, and the restriction on the microphone arrangement is
eliminated. Is possible. This makes it possible to solve the echo and howling problems that occur
when performing full-duplex calls easily and accurately.
[0018]
Hereinafter, an embodiment of the present invention will be described based on the drawings.
[0019]
<Configuration of Speech Processing Device> FIG. 1 is a block diagram for explaining the
configuration of the speech processing device according to the present embodiment.
The voice processing apparatus shown in FIG. 1 is used in a near end apparatus or a far end
apparatus applied to a television conference system.
[0020]
The audio processing apparatus according to the present embodiment includes an audio input
unit 10, a signal processing unit 20, a speaker SP, A / D converters 11a and 11b, a D / A
converter 12, an audio codec unit 30, and a communication unit 40. There is.
[0021]
The voice input unit 10 has a configuration in which a plurality of microphones having directivity
in opposite directions are incorporated in one case.
04-05-2019
6
In this embodiment, two microphones M1 and M2 are incorporated in one housing, the
directivity of one microphone M1 is directed to the speaker side, and the directivity of the other
microphone M2 is directed to the speaker SP There is.
[0022]
The signal processing unit 20 is constituted by a digital signal processor (DSP), and performs
processing of converting input and output audio data into desired data. In particular, in the
present embodiment, the signal processing unit 20 transmits the voice captured by the
microphone M1 having one directivity of the voice input unit 10 to the other party, and receives
the voice sent from the other party to receive the speaker SP. Process to output to
[0023]
The signal processing unit 20 is provided with a sound source separation unit described later.
The sound source separation unit acquires the voice captured by the voice input unit, the voice
captured by the microphone M1 having one directivity of the voice input unit 10, and the
microphone M2 having the other directivity of the voice input unit 10 A process is performed to
separate the two based on the difference in signal amount between the speaker SP and the other
party's voice.
[0024]
The A / D converters 11a and 11b perform processing of taking in by the microphones M1 and
M2 and converting an audio analog signal amplified by an amplifier (not shown) into a digital
signal at a predetermined sampling rate. The digital signal of the sound converted by the A / D
converters 11 a and 11 b is sent to the signal processing unit 20.
[0025]
The D / A converter 12 converts the digital signal of the voice from the other party output from
04-05-2019
7
the signal processing unit 20 into an analog signal. The analog signal of the sound converted by
the D / A converter 12 is sent to the speaker SP and output as a sound.
[0026]
The voice codec unit 30 performs processing of encoding a digital signal of voice to be sent to
the other party and processing of decoding a digital signal of voice sent from the other party.
[0027]
The communication unit 40 is a portion that performs signal input / output with the far-end
device via the communication line N such as the Internet or LAN (Local Area Network), and
transmits / receives a digital signal of encoded voice.
[0028]
Next, the sound source separation unit will be described.
The sound source separation unit is a method based on SAFIA which is one of the sound source
separation methods, separates and extracts the voice of the target speaker only, and sends only
the voice of the target speaker separated and extracted to the voice codec unit.
[0029]
The basic processing of the sound source separation method SAFIA is described in Patent
Documents 1 and 2.
Here, a specific example in the case where the sound source separation method SAFIA is applied
to an echo canceller in the present embodiment will be described.
[0030]
The input of the sound source separation method SAFIA requires an input sound in which the
04-05-2019
8
desired sound (sound of the target speaker) is dominant and an input sound in which the desired
sound (sound from the speaker) is dominant. Therefore, there is an installation restriction in
which at least two microphones are provided, one is located near the speaker and the other is
located near the speaker.
[0031]
In the present embodiment, when applying the sound source separation method SAFIA to an
echo canceller, multiple microphones M1 and M2 are incorporated in one housing in order to
eliminate restrictions on the installation of the microphones, and the directivity of these
microphones M1 and M2 The gender is set in the opposite direction (for example, the opposite
direction). As a result, the sound source separation method SAFIA can be applied to the echo
canceller only by arranging the housing in which the microphones M1 and M2 are incorporated
between the speaker and the speaker SP.
[0032]
Also, by placing marks such as arrows indicating the direction of directivity of each of the
microphones M1 and M2 on the housings into which the microphones M1 and M2 are
incorporated, installation of the microphones (housings) can be made more reliable. It becomes
possible.
[0033]
<Other Signal Processing Device> FIG. 2 is a block diagram for explaining another signal
processing device according to the present embodiment.
The audio processing apparatus includes an audio input unit 10, a signal processing unit 20,
speakers SP, A / D converters 11a and 11b, a D / A converter 12, an audio codec unit 30, and a
communication unit 40.
[0034]
The voice input unit 10 is a portion for capturing voice with a plurality of microphones M1 and
04-05-2019
9
M2. In the audio input unit 10 of the present embodiment, as described above, even when the
plurality of microphones M1 and M2 are incorporated in one housing, the plurality of
microphones M1 and M2 are provided independently. The configuration may be adopted.
[0035]
The signal processing unit 20 is constituted by a digital signal processor (DSP), and performs
processing of converting input and output audio data into desired data. In particular, in the
present embodiment, a beam forming unit (described later) is provided which sets the directivity
of the voice captured by the voice input unit 10 from the voice captured from the speaker side
and the voice captured from the speaker SP side.
[0036]
In addition, the signal processing unit 20 is provided with a sound source separation unit
(described later). The sound source separation unit separates the sound captured by the voice
input unit according to the directivity set by the beam forming unit from the difference in the
signal amount between the sound captured from the speaker and the sound captured from the
speaker. Then, processing is performed to transmit only the voice acquired from the speaker side
obtained by separation to the other party.
[0037]
The A / D converters 11a and 11b perform processing of taking in by the microphones M1 and
M2 and converting an audio analog signal amplified by an amplifier (not shown) into a digital
signal at a predetermined sampling rate. The digital signal of the sound converted by the A / D
converters 11 a and 11 b is sent to the signal processing unit 20.
[0038]
The D / A converter 12 converts the digital signal of the voice from the other party output from
the signal processing unit 20 into an analog signal. The analog signal of the sound converted by
the D / A converter 12 is sent to the speaker SP and output as a sound.
04-05-2019
10
[0039]
The voice codec unit 30 performs processing of encoding a digital signal of voice to be sent to
the other party and processing of decoding a digital signal of voice sent from the other party.
[0040]
The communication unit 40 is a portion that performs signal input / output with the far-end
device via the communication line N such as the Internet or LAN (Local Area Network), and
transmits / receives a digital signal of encoded voice.
[0041]
Next, the sound source separation unit will be described.
The sound source separation unit separates and extracts the voice of only the target speaker by a
method based on SAFIA which is one of the sound source separation methods, and sends only the
voice of the target speaker separated and extracted to the voice codec unit 30.
The basic processing of the sound source separation method SAFIA is described in Patent
Documents 1 and 2.
[0042]
In the present embodiment, based on the directivity set by the beamforming unit, the voice of
only the target speaker is separated and extracted from the captured voice signal by the sound
source separation method SAFIA. Therefore, by setting the directivity in the beam forming unit,
the directivity is set in an arrangement state without imposing restrictions on the arrangement of
the plurality of microphones M1 and M2, and a state equivalent to a desired microphone
arrangement is created. Is possible.
[0043]
In addition, since the directivity of the microphones M1 and M2 is set by the beam forming unit,
04-05-2019
11
not only in the case where the voice input unit 10 in which a plurality of microphones M1 and
M2 are incorporated in one housing as described above Even when using a plurality of
microphones M1 and M2 provided independently, it is possible to obtain a desired microphone
input.
[0044]
FIG. 3 is a block diagram for explaining the configuration of a signal processing unit provided
with a beam forming unit.
The signal processing unit 20 is provided with a beam forming unit A 21 a, a beam forming unit
B 21 b, and a sound source separation unit 22. A signal from an A / D converter 11a that
converts the sound captured by the microphone M1 shown in FIG. 2 into digital data is input to
the beam forming unit A21a. Further, a signal from the A / D converter 11b that converts the
sound captured by the microphone M2 shown in FIG. 2 into digital data is input to the beam
forming unit B21b.
[0045]
Here, the beam forming unit A21a is set to make the directivity of the voice taken in from the
speaker side strong, and the beam forming unit B21 b is set to make the voice taken from the
speaker SP strong in directivity. Thereby, a signal in which the voice taken in from the speaker
side is dominant is input from the beam forming unit A 21 a to the sound source separation unit
22, and a signal in which the voice taken in from the speaker SP is dominant from the beam
forming unit B 21 b is a sound source It is input to the separation unit 22.
[0046]
The sound source separation unit 22 separates only the voice on the speaker side by the source
separation method SAFIA from the difference between the signal amounts of the amount signals
sent from the beamforming unit A21a and the beamforming unit B21b, and the separated voice
on the speaker side Only the voice codec unit 30 is transmitted.
[0047]
04-05-2019
12
<Audio Processing Method> Next, an audio processing method using this audio processing device
will be described.
First, signals sent from the two microphones M1 and M2 shown in FIG. 2 via the A / D converters
11a and 11b are sent to the signal processing unit 20 as digital data of 48 kHz sampling 16 bit
PCM, for example.
[0048]
Then, these two systems of microphone input audio signals are input to the two beam forming
units A 21 a and B 21 b shown in FIG. 3, respectively.
[0049]
The beamforming unit A21a outputs voice when directivity is directed to the target speaker, and
the beamforming unit B21b outputs voice when directivity is directed to the speaker SP.
[0050]
Here, many methods have been proposed for beamforming.
As a commonly used method, there are a delay-and-sum method, an adaptive beamformer, etc.,
any beamforming method may be used.
In addition, these methods can also be extended to a plurality of microphones, and two or more
microphones may be used to direct directivity toward the target speaker and the speaker. In
practice, using a plurality of microphones can form sharp directivity.
[0051]
Next, these two output voices are sent to the sound source separation unit 22, and the voice of
only the target speaker is separated and extracted by the method based on the sound source
separation method SAFIA, and is sent to the voice codec 30. The basic processing of the sound
04-05-2019
13
source separation method SAFIA is described in Non-Patent Documents 1 and 2.
[0052]
When the sound source separation method SAFIA is applied to an echo canceller, the sound
source separation unit 22 is an input sound in which the sound desired to be taken (the sound of
the target speaker) is dominant and the sound in which the sound desired to be canceled (sound
from the speaker) is dominant And it is necessary to input. In the present embodiment, by
forming directivity with the beam forming unit A 21 a and the beam forming unit B 21 b, an
input sound in which the sound of the target speaker is dominant and an input sound in which
the sound from the speaker SP is dominant are obtained. . Thus, the arrangement of the
microphones M1 and M2 can be equivalently set by adjusting the directivity without imposing
strict restrictions on the arrangement of the plurality of microphones M1 and M2.
[0053]
In particular, the user can handle as if they were one microphone by setting directivity by
beamforming using an audio input unit in which a plurality of microphones M1 and M2 are
incorporated in one case. There is also.
[0054]
By applying such a sound source separation method SAFIA to an echo canceler, there is no
problem of echo return time to convergence when the system changes like an adaptive filter.
Furthermore, it can be realized using a filter bank or Fourier transform, and a reduction method
of the calculation amount has already been proposed for these, and it can be realized with a small
amount of calculation as compared with the case of using an adaptive filter.
[0055]
Further, in the present embodiment, there is no problem of increasing the amount of calculation
to cope with a long reverberation time when using an adaptive filter, regardless of the
reverberation time of the room to be used. Moreover, it becomes possible to remove the
04-05-2019
14
restriction of the microphone arrangement which becomes a problem when the sound source
separation method SAFIA is applied to the echo canceller as it is.
[0056]
<Method of Learning Beamforming> Next, a method of learning directivity of beamforming will
be described. FIG. 4 is a flowchart for explaining a method of learning directivity of beamforming.
First, as shown in step S1, the SN ratio of the speaker output is calculated.
[0057]
Next, as shown in step S2, the SN ratio of the microphone input is calculated. At this time, after
averaging the amplitude values of a plurality of microphones, the SN ratio is calculated.
[0058]
Next, as shown in step S3, the SN ratio of the speaker output is compared with a predetermined
threshold value, and if the SN ratio is equal to or greater than the threshold value, it is
determined that sound is emitted from the speaker, as shown in step S4. The beamforming unit
A21a is learned so as to minimize the output. That is, the directivity of the beamforming unit
A21a is learned so as to suppress the sound of the speaker.
[0059]
On the other hand, if it is determined in step S3 that the SN ratio of the speaker output is less
than the threshold, the process proceeds to step S5 to compare the SN ratio of the microphone
input with a predetermined threshold. In this comparison, if the SN ratio of the microphone input
is equal to or greater than the threshold value, it is determined that only the target voice, and as
shown in step S6, the beamforming unit B21b is learned to minimize the output. That is, the
directivity of the beamforming unit B21b is learned so as to suppress the target voice.
[0060]
04-05-2019
15
On the other hand, if it is determined in step S5 that the SN ratio of the microphone input is
equal to or less than the threshold, it is determined that neither the target voice nor the speaker
output is present, and the beamforming unit A21a and the beamforming unit B21b do not learn.
[0061]
Next, at step S7, the output voice of the beamforming unit A21a is calculated, and at step S8, the
output voice of the beamforming unit B21b is calculated, and sent to the sound source separation
unit as shown at step S9.
[0062]
It is a block diagram explaining the composition of the speech processing unit concerning this
embodiment.
It is a block diagram explaining the structure of the other speech processing unit concerning this
embodiment.
It is a block diagram explaining the composition of a signal processing part provided with a beam
forming part. It is a flowchart explaining the learning method of the directivity of beam forming.
Explanation of sign
[0063]
DESCRIPTION OF SYMBOLS 10 ... Speech input part, 20 ... Signal processing part, 30 ... Codec
part, 40 ... Communication part, N ... Communication line, M1 ... Microphone, M2 ... Microphone,
SP ... Speaker
04-05-2019
16
1/--страниц
Пожаловаться на содержимое документа