close

Вход

Забыли?

вход по аккаунту

JP2007006353

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2007006353
The present invention provides a microphone array capable of performing delay addition
processing without requiring interpolation calculation. A microphone array according to the
present invention arranges adjacent microphones at intervals of integral multiples of the distance
the sound travels at sampling intervals. That is, when the sampling interval of sound is T, the
sound velocity is v, and the integer is k, the microphones are arranged at an interval of d =
kvT . Then, the microphone array configured in this way is directed to the direction of the
target sound source. [Selected figure] Figure 4
マイクロフォンアレイ
[0001]
The present invention relates to the structure of a microphone array, and more particularly, to
the structure of a microphone array capable of eliminating interpolation processing required
when performing delay addition processing.
[0002]
It can be said that the current speech recognition system has reached a level that can be
practically used if a close-talking microphone is used in an ideal environment.
However, when receiving voice uttered away from the microphone in a real environment, S / N
deteriorates due to attenuation of sound energy, background noise, and reverberation in a room,
04-05-2019
1
and the recognition rate decreases rapidly. From this, the practicality of the speech recognition
system is required to be robust against noise and reflected sound from a wall.
[0003]
In systems that perform speech recognition, in general, a microphone array is widely used (NonPatent Document 1, Patent Document 1, etc.). In this microphone array, the microphones are
arranged in a straight line or a plane at regular intervals, and only the sound coming from the
target direction can be emphasized. The principle of this microphone array is shown in FIG. In
FIG. 3, Ma and Mb are microphones disposed on the main axis at an interval d. When a sound
wave reaches this microphone from infinity at an angle θ, the difference in arrival time to
adjacent microphones becomes d cos θ / v (where v is the speed of sound). Therefore,
if the signals received by the microphones are subjected to addition processing while shifting the
arrival time difference, only the signal from the designated angle θ can be emphasized. J. L.
Flanagan, J. D. Johnston, R. Zahn and G. W. Elko, "Computer-Steered Microphone Arrays for
Sound Transduction in Large Rooms", J. Acoust. Soc. Am., Vol. 78, No. 5, pp. 1508-1518, 1985.
[0004]
When performing signal processing using such a microphone array, digital processing is
generally performed. However, when acquiring signals by such digital processing, only discrete
signals based on the sampling period can be obtained, and the points to be subjected to addition
processing of the signals received by the respective microphones coincide with the sampling
points May not. When this state is shown in FIG. 8, the signal (FIG. 8 (a)) received by the
microphone Ma is received behind the microphone Mb by the delay time d cos θ / v . At this
time, the value at the sampling time of the sound wave form by the microphone Ma corresponds
to the circle at the microphone Mb in the lower diagram (b). However, since the sampling time at
the microphone Mb does not generally coincide with the time to be added, it is necessary to
interpolate the value to be added from the sampling value. However, there is also a method of
performing such interpolation calculation at high speed, but generally it takes time and effort to
process, and an error may be caused in the calculation by all means, which may lower the speech
recognition rate. .
[0005]
Therefore, the present invention has been made in view of the above problems, and an object of
the present invention is to provide a microphone array that can perform delay addition
04-05-2019
2
processing without requiring interpolation calculation.
[0006]
That is, in the microphone array in which a plurality of microphones are arranged in a straight
line, the present invention solves the above-mentioned problems by taking an integer multiple of
the distance the sound travels in the sampling interval (k is an integer, v is the velocity, and T is
the sampling interval. Then, the microphone is placed on "d = kvT").
[0007]
By arranging the microphones with such an interval, all the signals at the sampling points of the
microphones can be matched as shown in FIGS. 8 (c) and 8 (d) by directing the microphone array
toward the target sound source. Interpolation calculation at the time of performing addition
processing can be eliminated.
As a result, the robustness of the sound processing can be secured, and eventually the
recognition rate of the target voice can be improved.
[0008]
Also, in order to detect the arrival direction instantaneously or in a short time in the initial state,
a plurality of microphones are provided in directions orthogonal to each other with respect to the
linear direction in which the microphones are arranged.
[0009]
In general, when microphones are arranged only in the main axis, even if the incident angle θ of
the sound wave with respect to the axial direction of the microphone array is known, from which
direction of the cone the apex angle θ is with respect to the main axis Absent.
On the other hand, as described above, if the microphones are provided in directions orthogonal
to each other, three cones indicating the direction of the sound source with respect to each axis
can be obtained. The direction of the sound source can be uniquely identified by one of the two
intersections of the two cones.
04-05-2019
3
[0010]
The microphone array of the present invention arranges the microphones at an interval of "d =
kvT", where T is sound sampling interval, v is sound velocity, and k is an integer, so the main axis
of this microphone array is aimed By directing it to the direction of the sound source, it is
possible to eliminate the interpolation calculation that was necessary when performing the delay
addition process.
As a result, the robustness of the sound processing can be secured, and eventually the
recognition rate of the target voice can be improved.
[0011]
Hereinafter, a microphone array according to an embodiment of the present invention will be
described with reference to the drawings.
[0012]
FIG. 1a shows a perspective view of the microphone array 2 in the present embodiment, FIG. 1b
shows a side view thereof, and FIG. 1c shows a plan view.
In FIG. 1, at least two or more microphones are provided on the main axis (as an X-axis), and at
least one or more microphones are also provided in the Y-axis direction and the Z-axis direction
orthogonal to the main axis . The microphones are provided at equal intervals with a relationship
satisfying the number 1 described later. And as a whole, the direction of the main axis can be
directed to the direction of the sound source.
[0013]
FIG. 2 shows a functional block diagram of the microphone device 1 to which the microphone
array is connected. In FIG. 2, 1 is a microphone device, 2 is a microphone array, 3 is an A / D
converter, and 4 is a delay addition processing unit. A signal received by the microphone array 2
04-05-2019
4
is converted into a digital signal by the A / D converter 3, subjected to delay addition processing
by the delay addition processing unit 4, and then output to the speech recognition device 5. .
Then, voice recognition is performed by the voice recognition device 5.
[0014]
Next, the structure of the microphone array 2 will be described. When sound waves are input to
the microphone array 2 configured as described above from infinity at an angle θ, each sound
wave is received by each microphone with a time difference corresponding to the path L (= d cos
θ) as shown in FIG. However, when the correct sound source direction is known, if the adjacent
microphones are arranged at intervals of integral multiples of the distance the sound travels in
the sampling intervals, interpolation required for performing the conventional delay addition
processing It is possible to eliminate the calculation. The following two conditions are required to
eliminate interpolation processing by this delay addition. (1) The microphones are linearly
arranged at intervals of integer multiples of the distance the sound travels in the sampling
interval (k is an integer, v is sound velocity, T is a sampling interval, and is represented by kvT).
(2) The microphone array 2 is correctly directed to the target sound source.
[0015]
If such conditions are satisfied, it is possible to eliminate the need for interpolation calculation by
eliminating the difference between the time to be subjected to delay addition and the sampling
time. That is, if the microphones are arranged at the intervals as described above, as shown in the
lower diagrams (c) and (d) of FIG. 8, the delay time of the signal received by each microphone
becomes an integral multiple of the sampling interval. The sampling time and the time to be
added can be matched, eliminating the need to perform interpolation calculations.
[0016]
Next, a method of orienting the microphone array 2 in the direction of the target sound source
will be described. In order to direct the main axis to the direction of the target sound source, it is
necessary to estimate the sound source direction. In this embodiment, first, a CSP (Cross
Spectrum Phase analysis) method is used as a method of estimating the sound source direction.
This method can be realized with a two-element microphone and has a small amount of
calculation. This CSP method is based on the following CSP function calculated from the signals
04-05-2019
5
sa (n) and sb (n) received by the microphones Ma and Mb shown in FIG. = k1T) and the sound
source direction .theta.1 is obtained. Specifically, Ca, b (k), k1 and θ1 are calculated by the
following equation.
[0017]
[0018]
[0019]
[0020]
That is, the signals received by the two microphones Ma and Mb are subjected to Fourier
transform, and the CSP function is obtained as a mutual function of the phase, and the time
difference (time lag with strong correlation) k1 that increases the CSP function is obtained.
Estimate the direction of arrival.
[0021]
Further, as shown in FIG. 4, when the microphones are arranged in a straight line, even if the
angle θ1 is found by Equation 3, it is not clear from which direction of the circumferential
direction the sound is coming from the microphone array 2.
[0022]
Therefore, two microphones are provided in the Y-axis direction and the Z-axis direction
orthogonal to the main axis, and similarly, the angles θ, θy, θz (θ is the angle formed with the
x-axis, θy, z is the y-axis / z coordinate axis Estimate the angle with the
Then, three cones indicating the direction of the sound source as shown in FIG. 4 are obtained,
and the intersections of the three cones can be determined.
The sound source is present in the direction of this straight line, and it is possible to obtain the
direction of arrival of the sound coming from any direction by one data acquisition.
04-05-2019
6
[0023]
When there is one sound source, the CSP method is very good as a method of estimating the
sound source direction.
However, when one target sound source is found from a plurality of sound sources, it is very
difficult to determine whether the estimated sound source is the target sound or noise by sound
processing alone.
Therefore, the discrimination of the sound source is performed by image processing, and it is
assumed that the direction θ1 obtained from k1 at which the CSP function Ca, b (k) is maximum
is not necessarily the target sound source direction, m Also, km-1 which is a large correlation at
the second (m ≦ the number of sound sources) is also determined.
When the number of sound sources is unknown, all time delays for which the CSP takes a certain
value (0.2 empirically) or more are obtained. Here, the sample points k1 ± 1 before and after the
sample point k1 at which the correlation is the largest may also have a large correlation.
Therefore, when the sample point to be the second largest correlation is not separated from the
sample point k1 by one or more samples, the sample point is ignored. In general, the sound
source direction is determined by the delay time and the microphone interval, but assuming that
the distance between the microphones arranged in three dimensions is d = 10 cm, the maximum
error of the sound source direction determined from the above condition CSP function Ca, b (k) It
will be 10 degrees.
[0024]
When the microphone spacing is fixed at kvT, only the sound coming from the sound source
present on the microphone array 2 can be delayed and added except for the interpolation
processing. However, in the above method, although the approximate direction of the target
sound source is known, an error occurs in the estimation of θ, θy, and θz. Therefore, in order
to perform delay addition processing without interpolating a sound received by the microphone,
it is necessary to estimate a more accurate sound source direction.
04-05-2019
7
[0025]
Therefore, in the present embodiment, using a camera, the direction of the target sound source is
accurately captured by the difference information between the images captured at certain short
time intervals. However, it is assumed that the microphone array 2 substantially faces the
direction of the target sound source by the above-described sound processing, and a person is in
the field of view of the camera and the person is a speaker of the target voice. Since the speaker
is speaking at least his mouth is moving and difference information can be obtained. Also, the
position of the speaker is within several meters from the microphone, and a sound environment
with little reverberation is assumed. As shown in FIG. 5, it is assumed that the size of the image
captured by the camera is horizontal i∈I, j∈J.
[0026]
Let the image G taken by the camera at time t be the coordinates (i, j), the color c (c = r, g, b) and
the function G (i, j, c, t) at time t, and the difference image D (i, j) j) is defined by the following
equation (time t is omitted)
[0027]
[0028]
When D (i, j) exceeds a predetermined threshold value α, it is expressed that
difference is present at the coordinates (i, j).
the threshold
[0029]
Based on the difference image D, the direction of the speaker can be estimated as follows.
(1) From the common area of the three cones, roughly estimate the direction of the target sound
source, and turn the direction of the main axis toward the target sound source.
(2) Using a camera mounted in the axial direction of the microphone array 2, a difference D (i, j)
between two still images captured at an interval of Δt is obtained.
04-05-2019
8
(3) Fine-adjust the orientation of the microphone array 2 so that the center of gravity of the
image (referred to as a threshold difference image) composed of pixels having differences equal
to or greater than a threshold is at the center of the camera field of view.
[0030]
When a human moves, taking the difference between the images, it is considered that the
uppermost position (several 5 and 6) where D exceeds the threshold is the human crown. Here,
let the center of gravity of the threshold difference image represented by the equation 7 be (ic,
jc). The person and the camera are several meters apart, and it is considered that the head of the
person falls within a range of 10 degrees downward and 5 degrees left and right from (i0, j0).
[0031]
[0032]
[0033]
[0034]
When the horizontal viewing angle of the camera is, for example, 45 degrees, when dividing the
image captured by the camera into three equal parts, each divided part projects an area of about
15 degrees in horizontal angle become.
The horizontal viewing angle is 15 degrees because the direction estimation error by CSP method
is within an average of 4.6 degrees, and the speaker captures with the camera when the direction
of the microphone array 2 is changed to the sound source direction determined by acoustic
processing. It is likely to come to the center of the image.
Therefore, it can be expected that the contour of the face (obtained by connecting (i, j) such that
04-05-2019
9
D (i, j) α α) appears as a threshold difference in the central portion of the difference image.
Further, it is assumed that a person is present in the direction in which the threshold difference
appears. This person is not necessarily the speaker. That is, when the microphone array 2 is
directed to the sound source direction θp (p is an integer) obtained by acoustic processing and a
difference is taken, the threshold is applied to the central portion (region for 15 degrees) of the
image by pixels having differences above the threshold. If the difference does not appear and the
threshold difference image appears at both ends, it can be said that the person shown on the
screen is not the speaker.
[0035]
Make the microphone array 2 accurate by changing the orientation of the microphone array 2 so
as to close the difference between the orientation of the mouth determined using the above
method and the center of the image taken with the camera (I / 2, J / 2) In the direction of the
speaker's mouth. An operation screen of a microphone array direction correction system using
image processing is shown in FIG.
[0036]
The screen A in FIG. 6 is a real-time image captured by a camera installed in the same direction
as the microphone array 2, and the screens B and C are images captured with a few seconds
difference (a screen C captured a few seconds later) The image D) and the screen D are threshold
difference images which are equal to or more than the threshold α of the screen B and the
screen C. Finally, the difference between the center of gravity determined from the difference
image and the center of the image taken by the camera is indicated by E as an angle.
[0037]
The noise source is assumed to generate noise without movement (for example, a PC or an air
conditioner). Also, people who do not make loud noises and do not speak words are not
considered as speakers. First, ask the speaker to say something (a word such as "here" or "here"),
estimate the direction of arrival of the sound by using a three-dimensionally arranged
microphone, and make the microphone array 2 Orient in the estimated direction. However, as
04-05-2019
10
described above, there is a possibility that the microphone array 2 may face in the noise
direction. Therefore, it is determined by image processing (difference information) that there is
no person in the direction of the main axis. If it is found that there is no person, the direction of
the main axis is changed again. If the main axis is directed to the direction of the speaker, the
image processing (difference information) accurately directs the main axis to the speaker. Finally,
ask the speaker to utter the word you want them to recognize again. Each processing procedure
(steps 1 to 10) will be briefly described with reference to FIG.
[0038]
Steps 1-2: In order to roughly grasp the position of the speaker, the voice of the speaker is
received by the microphone, and the time delay k1T at which the CSP is maximized is detected.
However, the maximum time delay k1T obtained by the CSP method is not limited to the signal
from the target sound source (the target sound source is one), and may be a signal from noise.
Ask. The number of p to be determined in step 2 is as described above.
[0039]
Step 3: Determine the sound source direction θp. The time difference vkpT for the signal from
the sound source to reach each microphone is vk1T (time difference at which CSP is maximum, p
= 1). The sound source direction θ1 at this time is determined by Expression 3, and the
microphone array 2 is directed to the sound source direction.
[0040]
Steps 4 to 7: If the difference information is not obtained in the image D in FIG. 6, the sound
source direction currently directed is not a target sound source but a noise source. Therefore,
another estimated sound source direction θ2 is calculated from the time difference Δτ2 (p ←
p + 1) which is the second largest correlation, and the principal axis is directed to the θ2
direction. Likewise, the process is repeated until a threshold difference image between captured
time-series images appears.
[0041]
04-05-2019
11
Step 8: If a set of points D (i, j) α α between images temporally offset by Δt is at the center of
the image, the principal axis so that its center of gravity (ic, jc) is at the center of the image Finetune the direction of. As a result, the microphone array 2 is correctly directed to the target sound
source.
[0042]
Steps 9 to 10: Ask the speaker to speak again, and delay-add the voice received by the
microphone. Once the microphone array 2 is oriented in the direction of the speaker, the target
does not move much after that, so it becomes possible to track the target by image processing.
Here, since the microphone interval is fixed at 2 vT (interval for 2 samplings), it is a method of
shifting and adding 2 × n sampling points with respect to the microphone closest to the sound
source as a reference, so interpolation calculation is necessary There is no. Here, "n" is the
number of intervals counted from the last microphone (n = 1 to 7). Finally, the speech
recognition system is used to calculate the recognition rate.
[0043]
As described above, according to the present embodiment, since the distance between the
microphones is set to be an integral multiple of the distance that the sound travels to the
sampling interval, the microphone array 2 is oriented in the direction of the sound source. It is
possible to eliminate the interpolation calculation that was required during the delay addition
process.
[0044]
The present invention is not limited to the above embodiment, and can be implemented in
various aspects.
[0045]
That is, for example, in the above embodiment, the microphones are provided also in the Y-axis
direction and the Z-axis direction, but this is provided to estimate the sound source direction, and
the direction of the target sound source is provided by other means. If the microphones can be
accurately identified, the microphones need only be arranged on the main axis.
[0046]
04-05-2019
12
In order to investigate the effect of eliminating the interpolation processing, the case where the
interpolation processing necessary for general delay addition is performed is compared with the
case where the interpolation processing is not necessary for the delay addition.
Specifically, as shown in FIG. 9, a method that does not require interpolation (the microphone
array is correctly directed to the target sound source, and the microphones are arranged at an
integer k and sampling intervals) is used to investigate the effect of interpolation processing
exclusion .
[0047]
In order to examine how much the interpolation processing affects the recognition accuracy, the
recognition rate is compared between the case where the interpolation process is required (A)
and the case where the interpolation process is excluded (B).
(A) The microphones are linearly arranged at equal intervals.
Set sound source direction variously. (B) The microphones are linearly arranged at equal
intervals (intervals that are integral multiples of the distance the sound travels to the sampling
interval), and the direction in which the microphone array faces is correctly directed to the target
sound source.
[0048]
A1 and A2 are assumed as cases where interpolation processing is required. A1 is a case where a
target sound source is compulsorily assumed in the direction of a time difference in which CSP is
maximized, and A2 is a case where a noise source and a target sound source are distinguished
using differences between images.
[0049]
A1: Delay addition processing is performed assuming that the direction in which the CSP function
04-05-2019
13
is maximum is the target signal (forced addition processing even if the correlation is maximum in
the noise direction)
[0050]
A2: Using the CSP function and the image (difference), confirm that it is the target sound source,
and then perform normal delay addition processing.
[0051]
The voice data used in the experiment was recorded using a close-talking microphone in a
soundproof room, and the recognition rate at line input is 100%.
The breakdown of the voice data is a total of 150 utterances of 50 utterances each of 2 males
and 1 female.
The utterance content of the audio data is a television operation command, such as "TV ON" or
"TV Asahi". As a target sound, this was reproduced by a speaker in a soundproof room. As noise
sources, white noise and music were used and streamed from another speaker. The S / N ratio of
the sound source is 10 dB, and the recording conditions of the audio data are 16 Ksamples / sec
and 16 bits. The arrangement in FIG. 9 was L = 100 cm, θ = 60 °, d = 4.25 cm (= 2 vT). The size
of the dictionary used for speech recognition is 99, and the number of grammar rules is 13. The
speech recognition decoder uses "Julian".
[0052]
FIG. 10 shows the recognition rates of the case where interpolation processing is required and
the method of excluding interpolation. In the methods A2 and B, it is always better to perform
delay addition processing using a plurality of microphones than the result of recognition by one
microphone, but in the method A1, the processing result of only one microphone is the best. The
reason is that, in A1, since the sound from the direction in which the CSP function is maximized
is forcibly emphasized assuming the signal as the target signal, the sound may be emphasized
from the noise source. . From the above results, it can be said that the delay addition method
excluding the interpolation processing gives the best recognition rate, and there is an effect of
excluding the interpolation processing.
04-05-2019
14
[0053]
Table 1 shows the recognition rates of voices recorded by one microphone and the recognition
rates obtained by averaging all the two, four, and eight microphones used in the above three
methods (A1, A2, and B). The difference indicates the number of increase or decrease of the
recognized word. A sign test was conducted on this, and the recognition rate using white noise in
method B was significantly different at the 5% level.
[0054]
[0055]
The effect of the direction estimation error on the recognition rate is examined by comparing the
sound source direction determined only by acoustic processing and the estimation error of the
sound source direction determined by combining acoustic processing and image processing.
By examining the difference in recognition rate by shifting the direction in which the microphone
array is directed to the target sound source by an angle θ as shown in FIG. 11, the estimation
error of the sound source direction and how much it affects the recognition rate is examined.
[0056]
The sound data used the same thing as the above-mentioned "TV ON", "TV Asahi", etc., and the
noise used white noise. In miscellaneous experiments, the arrangement in FIG. 10 was L = 100
cm, ψ = 60 °, d = 4.25 (= 2 vT), and the angle θ for moving the microphone array was in 5 °
steps.
[0057]
FIG. 12 shows an estimation error from a target sound source when only sound processing and
04-05-2019
15
image processing are used in combination. Compared to the error of sound processing only (4.6
degrees), the error when image processing was used in combination was reduced to about 1/4,
1.18 degrees.
[0058]
FIG. 13 shows the relationship between the estimation error of the sound source direction and
the speech recognition rate. From this result, it is understood that the smaller the error between
the target sound source and the direction of the main axis, the higher the recognition rate. That
is, in order to obtain a high recognition rate, it is necessary to accurately capture the estimation
of the sound source direction.
[0059]
Next, improvement of the recognition rate when sound source direction is estimated by using
both sound processing and image processing is examined. A general method 1 using
interpolation processing is used as a baseline, and processing in which the microphone array is
directed to the sound source direction using only acoustic processing (method 2) and that using
image processing in combination (method 3) are compared. 1. After using the CSP function
and the image (difference) to confirm that it is the target sound source, delay addition processing
is performed. 2. Delay addition without interpolation (sound processing only) 3. Delayed
addition without interpolation (combination of sound processing and image processing)
[0060]
As the audio data, the same data as the above data ("TV ON", "TV Asahi", etc.) was used. White
noise was used for the noise.
[0061]
As shown in FIG. 14, a higher recognition rate was obtained by using the proposed method (use
of sound processing and image processing). Here too, when the above three methods 1, 2 and 3
were subjected to a code test, the results showed that there was a difference in the average value
04-05-2019
16
but no significant difference.
[0062]
Also, in the relationship between the number of microphones used and the voice recognition rate
(Figs. 10, 13, 14), comparing the number of microphones used for delay addition with 4 and 8
results in similar recognition rates? Rather, it will be lower if delay addition processing is
performed with eight. This may be due to sound attenuation, microphone interference,
microphone spacing errors, etc.
[0063]
In the present embodiment, a delay addition method has been proposed in which acoustic
processing and image processing are used in combination, a plurality of microphones are used,
and interpolation processing is eliminated. As a result of recognition experiments, in a noise
environment where speech with 100% recognition rate drops to about 60% at line input, delay
addition using 4 microphones improves speech recognition rate by about 8% .
[0064]
The functional block diagram of the microphone device in the present embodiment showing the
relationship between the microphone and the input direction of sound. The figure showing the
arrival direction of the figure sound. The operation screen example of the array direction
correction system in the same form The flowchart showing the processing procedure in the same
form The figure showing the relationship between the audio signal and the sampling time The
figure showing the arrangement for emphasizing the target sound source in the present
embodiment Figure showing the effect of processing exclusion Figure showing arrangement for
examining change in recognition rate due to direction estimation error in the same embodiment
Figure showing comparison of direction error in the same embodiment error in direction of
target sound source and array in the same embodiment Diagram showing the relationship
between speech recognition rate and recognition rate Experimental results showing speech
recognition rate in this example
Explanation of sign
04-05-2019
17
[0065]
1 ... Microphone array device 2 ... Microphone array 3 ... A / D converter 4 ... Delayed addition
processing unit 5 ... Speech recognition device Ma, Mb ... My, Mz ... Microphone
04-05-2019
18
1/--страниц
Пожаловаться на содержимое документа