JP2013181899

Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2013181899
Abstract: The present invention provides a voice analysis device and the like that can better grasp
the distance between a speaker and another person at a position where the voice of the speaker
can reach. A voice is provided for a wearer and acquires voice, and at least one of the wearer
receives information on a voice signal of the voice from a plurality of microphones disposed at
positions different in distance from the mouth of the wearer. An information receiving unit, a
tuneability determination unit that determines the tuneability of voice from information related
to voice signals of voices obtained from a plurality of wearers, it is determined that the
tuneability of voice exists, and the distance from the wearer's mouth Were acquired by the
microphones arranged for the speaker and the other person when the speech was identified as
the self-speech based on the comparison result of the speech signals of the speech acquired by
the plurality of microphones arranged at different positions. And a distance deriving unit that
derives a distance from another person relative to the speaker from the sound pressure of the
voice. [Selected figure] Figure 1
Speech analysis device, speech analysis system and program
[0001]
The present invention relates to a voice analysis device, a voice analysis system, and a program.
[0002]
Patent Document 1 comprises means for inputting data of a building shape, a partition and a
noise source, means for calculating sound pressure levels and intelligibility at a large number of
sound receiving points in a room based on these data, and A room sound environment evaluation
system is disclosed comprising means for creating a sound pressure level distribution map and
03-05-2019
1
an intelligibility distribution map at a sound reception point.
[0003]
Japanese Patent Application Laid-Open No. 7-28481
[0004]
An object of the present invention is to make it possible to grasp the distance between a speaker
and another person at a position where the speaker's voice reaches from attenuation of sound
even if the position of the speaker is not known in advance. .
[0005]
According to the first aspect of the present invention, the voice of the voice is distributed from
the voice acquisition means disposed at the wearer and acquiring the voice, and at least one of
the wearer is disposed at a plurality of positions at different distances from the wearer's mouth.
A voice information receiving unit for receiving information on a signal; a tuneability
determination unit for determining the tuneability of voice from information on voice signals of
the voice acquired from the plurality of wearers by the voice information receiving unit; The
audio is determined based on the comparison result of the audio signals of the voices acquired
respectively by the plurality of voice acquiring means which are determined by the unit as having
the audio synchrony and are arranged at different distances from the wearer's mouth. Whether
the sound pressure of the voice acquired by the voice acquisition means disposed to the speaker
and the sound pressure of the speech acquired by the voice acquisition means disposed to
another person when identified as speech A distance deriving unit that derives a distance
between the others relative to the speaker, a voice analysis device, characterized in that it
comprises a.
[0006]
The invention according to claim 2 is obtained by the voice acquisition means based on the
sound pressure difference of the voices acquired by the two voice acquisition means having
different distances from the wearer's mouth among the voice acquisition means. The voice
analysis device according to claim 1, further comprising a self-other identification unit that
identifies whether the voice is a voice of the wearer or a voice of a voice other than the wearer.
The invention according to claim 3 is characterized in that the time difference in which the voice
03-05-2019
2
of the other person reaches the voice acquisition means and the facing angle which is the angle
at which the wearer and the other face each other are derived from the distance the voice
acquisition means separates The initial value selecting means for selecting the initial value of the
numerical calculation for deriving the three-dimensional position of the person, and the threedimensional position of the other person by the numerical calculation using the initial value
selected by the initial value selecting means The apparatus further comprises a position deriving
unit, wherein the initial value selecting unit selects the initial value by using the distance between
the speaker and the other derived by the distance deriving unit. Or the voice analysis device
according to 2.
[0007]
According to a fourth aspect of the present invention, there is provided an audio acquisition
means which is disposed for the wearer and acquires voice and at least one of the wearer is
disposed at a plurality of positions at different distances from the wearer's mouth; An audio
information transmitting unit for transmitting information on an audio signal of audio acquired
by the acquiring means; an audio information receiving unit for receiving information on an
audio signal of audio transmitted by the audio information transmitting unit; and the audio
information receiving unit From the information on the voice signal of the voice acquired from
the plurality of wearers, the tuneability determination unit determining the tuneability of the
voice, and the tuneability determination unit determines that the tuneability of the voice exists
and the mouth of the wearer Voices are identified as self-speech based on comparison results of
voice signals of voices respectively acquired by the plurality of voice acquisition means arranged
at different positions from each other In this case, the sound pressure of the sound acquired by
the sound acquisition unit arranged for the speaker and the sound pressure of the sound
acquired by the sound acquisition unit arranged for the other person based on the speaker and
the other And a distance deriving unit for deriving the distance of
[0008]
The invention according to claim 5 is characterized in that the computer receives voice from the
voice acquiring means which is disposed at a wearer and acquires a plurality of voices and at
least one of the wearers is disposed at a plurality of positions at different distances from the
wearer's mouth. A function to receive information on audio signals of audio, a function to
determine the synchrony of audio from information on audio signals of the audio obtained from a
plurality of wearers, it is determined that the audio synchrony exists, and the wearer When it is
determined that the voice is a self-speech based on the comparison result of the voice signals of
the voices acquired respectively by the plurality of voice acquiring means disposed at different
positions from the mouth of the A function of deriving a distance to another person based on the
speaker from the sound pressure of the sound acquired by the sound acquisition means and the
sound pressure of the sound acquired by the sound acquisition means disposed to the other
03-05-2019
3
person; Is a program to realize.
[0009]
According to the invention of claim 1, it is possible to grasp the distance between the speaker
and the other who is in the position where the voice of the speaker can reach from the
attenuation of sound even if the position of the speaker is not known in advance. It is possible to
provide a voice analysis device that can
According to the second aspect of the present invention, the voice analysis device can be
provided with a function of identifying whether the voice acquired by the voice acquisition
means is the voice of the wearer's voice or the voice of another person other than the wearer.
According to the invention of claim 3, compared with the case where the present invention is not
adopted, when the three-dimensional position of the speaker is obtained, the position accuracy is
improved.
According to the invention of claim 4, it is possible to grasp the distance between the speaker
and the other person who is in the position where the voice of the speaker can reach from the
sound attenuation even if the position of the speaker is not known in advance. Can build a voice
analysis system that can
According to the invention of claim 5, it is possible to grasp the distance between the speaker
and the other person at the position where the voice of the speaker can reach from the
attenuation of the sound even if the position of the speaker is not known in advance. The
functions that can be performed can be realized by a computer.
[0010]
It is a figure showing the example of composition of the speech analysis system by this
embodiment.
03-05-2019
4
It is a figure which shows the structural example of the terminal device in this embodiment.
It is a figure which shows the relationship of a wearer's and the other person's mouth (speech
part), and a position with a microphone. It is a figure which shows the relationship between the
distance between a microphone and a sound source, and sound pressure (input sound volume). It
is a figure which shows the identification method of a user's own speech voice and the other
person's speech. It is a flowchart which shows operation ¦ movement of the terminal device in
this embodiment. It is the figure which explained in more detail about the data analysis part. It is
a flow chart explaining operation of a data analysis part. (A)-(c) is a figure explaining the method
to obtain time difference deltat12 in this embodiment. It is a figure explaining the face-to-face
angle in this embodiment. It is a figure explaining the method to obtain ¦ require the facing angle
(alpha) using a 1st microphone and a 2nd microphone. FIG. 6 is a conceptual diagram showing
the positional relationship between a point M1, a point M2, a point M3, and a point S. It is the
figure which extracted and explained about the cone which makes middle vertex C12 of point M1
and point M2 a top, and makes half apex angle alpha. It is the conceptual diagram which showed
the three-dimensional position of the candidate point set. It is the figure which explained in more
detail about the initial value selection part. It is a flowchart which shows operation ¦ movement
of the initial value selection part in this embodiment. It is a figure which shows the condition
where the several wearer who each mounted ¦ worn the terminal device of this embodiment is
having a conversation. It is a figure which shows the example of the speech information of each
terminal device in the conversation condition of FIG.
[0011]
Hereinafter, embodiments of the present invention will be described in detail with reference to
the accompanying drawings. <System Configuration Example> FIG. 1 is a view showing a
configuration example of a speech analysis system according to the present embodiment. As
shown in FIG. 1, the voice analysis system 1 of the present embodiment is configured to include a
terminal device 10 and a host device 20 which is an example of a voice analysis device. The
terminal device 10 and the host device 20 are connected via a wireless communication line. As a
type of wireless communication line, a line according to an existing system such as Wi-Fi
(Wireless Fidelity), Bluetooth (registered trademark), ZigBee, UWB (Ultra Wideband) may be
used. Further, in the illustrated example, only one terminal device 10 is described, but as will be
described in detail later, the terminal device 10 is worn and used by each user, and is actually
used The terminal devices 10 for the number of persons are prepared. Hereinafter, the user
wearing the terminal device 10 is referred to as a wearer.
03-05-2019
5
[0012]
The terminal device 10 includes a plurality of microphones (a first microphone 11a, a second
microphone 11b, and a third microphone 11c) and an amplifier (a first amplifier 13a, a second
amplifier) as voice acquisition means for acquiring the voice of the speaker. And an amplifier 13b
and a third amplifier 13c). The terminal device 10 further includes a voice analysis unit 15 that
analyzes the recorded voice, a data transmission unit 16 that transmits the analysis result to the
host device 20, and further includes a power supply unit 17.
[0013]
In the present embodiment, the first microphone 11a, the second microphone 11b, and the third
microphone 11c are spaced apart at a predetermined distance. As types of microphones used as
the first microphone 11a, the second microphone 11b, and the third microphone 11c in the
present embodiment, various existing microphones such as a dynamic type and a capacitor type
may be used. In particular, a nondirectional MEMS (Micro Electro Mechanical Systems)
microphone is preferable.
[0014]
The first amplifier 13a, the second amplifier 13b, and the third amplifier 13c amplify the electric
signal (audio signal) output by the first microphone 11a, the second microphone 11b, and the
third microphone 11c in accordance with the acquired voice. As the amplifiers used as the first
amplifier 13a, the second amplifier 13b, and the third amplifier 13c of the present embodiment,
an existing operational amplifier or the like may be used.
[0015]
The voice analysis unit 15 analyzes voice signals output from the first amplifier 13a and the
second amplifier 13b. Then, it is determined whether the voice acquired by the first microphone
11a and the second microphone 11b is a voice uttered by the wearer wearing the terminal device
10 or a voice uttered by another person. That is, the voice analysis unit 15 functions as a selfother identification unit that identifies the speaker of the voice based on the voice acquired by
the first microphone 11a and the second microphone 11b. The contents of specific processing
03-05-2019
6
for voice identification will be described later.
[0016]
The data transmission unit 16 transmits the acquired data including the analysis result by the
voice analysis unit 15 and the ID of the terminal to the host device 20 via the above-described
wireless communication line. As information to be transmitted to the host device 20, according to
the contents of processing performed in the host device 20, in addition to the above analysis
result, for example, acquisition of voice by the first microphone 11a, the second microphone 11b,
and the third microphone 11c Information such as time and sound pressure of acquired voice
may be included. The terminal device 10 may be provided with a data storage unit for storing the
analysis result by the voice analysis unit 15 and batch transmission of storage data for a fixed
period may be performed. In addition, you may transmit by a wire line. In the present
embodiment, the data transmission unit 16 functions as an audio information transmission unit
that transmits information related to an audio signal of audio acquired by the first microphone
11a, the second microphone 11b, and the third microphone 11c.
[0017]
The power supply unit 17 supplies power to the first microphone 11a, the second microphone
11b, and the third microphone 11c, the first amplifier 13a, the second amplifier 13b, the third
amplifier 13c, the voice analysis unit 15, and the data transmission unit 16 described above.
Supply. As a power supply, for example, an existing power supply such as a dry battery or a
rechargeable battery is used. Further, the power supply unit 17 includes known circuits such as a
voltage conversion circuit and a charge control circuit, as necessary. `
[0018]
The host device 20 outputs a data receiving unit 21 that receives data transmitted from the
terminal device 10, a data storage unit 22 that stores the received data, a data analysis unit 23
that analyzes the stored data, and an analysis result. And an output unit 24. The host device 20 is
realized by, for example, an information processing device such as a personal computer. Further,
as described above, in the present embodiment, a plurality of terminal devices 10 are used, and
the host device 20 receives data from each of the plurality of terminal devices 10.
03-05-2019
7
[0019]
The data reception unit 21 corresponds to the above-described wireless channel, receives data
from each terminal device 10, and sends the data to the data storage unit 22. In the present
embodiment, the data receiving unit 21 functions as an audio information receiving unit that
receives information on the audio signal output by the data transmitting unit 16. The data
storage unit 22 is realized by, for example, a storage device such as a magnetic disk device of a
personal computer, and stores the reception data acquired from the data reception unit 21 for
each speaker. Here, the identification of the speaker is performed by collating the terminal ID
transmitted from the terminal device 10 with the speaker name and the terminal ID registered in
advance in the host device 20. Further, instead of the terminal ID, the wearer's name may be
transmitted from the terminal device 10.
[0020]
The data analysis unit 23 is realized by, for example, a program-controlled CPU of a personal
computer, and analyzes data stored in the data storage unit 22. The specific analysis content and
analysis method can take various contents and methods according to the usage purpose and
usage mode of the system of the present embodiment. For example, analyzing the frequency of
interaction between the wearers of the terminal device 10 and the tendency of the other party of
the interaction with each wearer, or analogizing the relationship of the interlocutors from the
information of the length and sound pressure of each utterance in the dialogue To be done.
Further, in the present embodiment, the data analysis unit 23 derives the three-dimensional
position of the speaker, which will be described in detail later.
[0021]
The output unit 24 outputs an analysis result by the data analysis unit 23 or performs an output
based on the analysis result. The means for outputting this analysis result and the like can take
various means such as display display, print output by a printer, voice output, etc., depending on
the purpose of use and usage mode of the system, contents and format of the analysis result.
[0022]
03-05-2019
8
<Example of Configuration of Terminal Device> FIG. 2 is a view showing an example of the
configuration of the terminal device 10. As shown in FIG. As described above, the terminal device
10 is worn and used by each user. In order to make the user attachable, as shown in FIG. 2, the
terminal device 10 of the present embodiment is configured to include an apparatus main body
30 and a strap 40 connected to the apparatus main body 30. In the illustrated configuration, the
user wears the strap 40 and wears the device body 30 from the neck.
[0023]
The apparatus main body 30 includes at least a first amplifier 13a, a second amplifier 13b, a
third amplifier 13c, an audio analysis unit 15, a data transmission unit 16, and a power supply
unit 17 in a thin rectangular parallelepiped case 31 made of metal or resin. A circuit to be
realized and a power supply (battery) of the power supply unit 17 are housed. Further, in the
present embodiment, the case 31 is provided with the third microphone 11 c. Furthermore, the
case 31 may be provided with a pocket into which an ID card or the like displaying ID
information such as the name or affiliation of the wearer is inserted. Also, such ID information or
the like may be described on the surface of the case 31 itself.
[0024]
In the strap 40, the first microphone 11a and the second microphone 11b are provided
(hereinafter, when the first microphone 11a, the second microphone 11b, and the third
microphone 11c are not distinguished from each other, the microphones 11a, 11b, and 11c are
provided). And written). As materials of the strap 40, various existing materials such as leather,
synthetic leather, cotton and other natural fibers, synthetic fibers such as resin, metals, etc. may
be used. Moreover, the coating process using a silicone resin, a fluorine resin, etc. may be given.
[0025]
The strap 40 has a tubular structure, and the microphones 11 a and 11 b are housed inside the
strap 40. By providing the microphones 11a and 11b inside the strap 40, it is possible to prevent
the microphones 11a and 11b from being damaged or soiled, and to prevent the communicator
from being aware of the presence of the microphones 11a and 11b. In the present embodiment,
the device body 30 is attached to the user who is the user by hooking the ring-shaped strap 40
03-05-2019
9
around the neck.
[0026]
<Identification of Speaker (Self (Others)) Based on Non-verbal Information of Recorded Voice>
Next, a method of identifying a speaker in this embodiment will be described. The system
according to the present embodiment uses the information of the voices recorded by the two
microphones 11a and 11c provided in the terminal device 10 to discriminate between the voice
of the wearer of the terminal device 10 and the voice of another person. Do. In other words, this
embodiment identifies oneself and the other with respect to the recorded voice speaker. Further,
in the present embodiment, the speech information is not based on linguistic information
obtained by using morphological analysis or dictionary information, but based on non-verbal
information such as sound pressure (input sound volume to the microphones 11a and 11c).
Identify the In other words, the speaker of the voice is identified from the speech situation
specified by the non-language information, not the speech content specified by the language
information.
[0027]
As described with reference to FIGS. 1 and 2, in the present embodiment, the third microphone
11 c of the terminal device 10 is disposed at a position far from the mouth (speaking part) of the
wearer, and the first microphone 11 a is the wearer. Placed at a position close to the mouth
(speaking part) of That is, when the wearer's mouth (speaking part) is used as a sound source,
the distance between the third microphone 11c and the sound source and the distance between
the first microphone 11a and the sound source are largely different. Specifically, the distance
between the third microphone 11c and the sound source is about 1.5 to 4 times the distance
between the first microphone 11a and the sound source. Here, the sound pressure of the
recorded sound in the microphones 11a and 11c attenuates (distance attenuates) as the distance
between the microphones 11a and 11c and the sound source increases. Therefore, regarding the
speech voice of the wearer, the sound pressure of the recorded voice at the third microphone 11
c and the sound pressure of the recorded voice at the first microphone 11 a are largely different.
[0028]
On the other hand, considering the case where the mouth (speaking part) of a person (other)
03-05-2019
10
other than the wearer is the sound source, the distance between the third microphone 11c and
the sound source is because the other person is separated from the wearer. The distance between
the first microphone 11a and the sound source does not change significantly. Although there
may be a difference between the two depending on the position of the other person with respect
to the wearer, the distance between the third microphone 11c and the sound source is the first as
in the case where the wearer's mouth (speaking part) is used as the sound source. It will not be
several times the distance between the microphone 11a and the sound source. Therefore,
regarding the speech voice of the other person, the sound pressure of the recorded voice at the
third microphone 11c and the sound pressure of the recorded voice at the first microphone 11a
do not differ greatly as in the case of the speech of the wearer.
[0029]
FIG. 3 is a diagram showing the positional relationship between the mouths of the wearer and the
other person (speaking parts) and the microphones 11a and 11c. In the relationship shown in
FIG. 3, the distance between the third microphone 11c and the sound source a, which is the
mouth of the wearer (a vocal region), is La3, and the distance between the sound source a and
the first microphone 11a is La1. Further, the distance between the sound source b which is the
other person's mouth (speaking part) and the third microphone 11c is Lb3, and the distance
between the sound source b and the first microphone 11a is Lb1. In this case, the following
relationship holds. La3>La1(La3≒1.5×La1〜4×La1) Lb3≒Lb1
[0030]
FIG. 4 is a diagram showing the relationship between the distance between the microphones 11a
and 11c and the sound source and the sound pressure (input volume). As described above, the
sound pressure attenuates in accordance with the distance between the microphones 11a and
11c and the sound source. In FIG. 4, when the sound pressure Ga3 at the distance La3 and the
sound pressure Ga1 at the distance La1 are compared, the sound pressure Ga1 is about four
times the sound pressure Ga3. On the other hand, since the distance Lb3 and the distance Lb1
approximate each other, the sound pressure Gb3 at the distance Lb3 and the sound pressure Gb1
at the distance Lb1 are substantially equal. Therefore, in the present embodiment, the difference
between the sound pressure ratios is used to discriminate between the user's own utterance
voice and the other person's utterance voice in the recorded voice. Although the distances Lb3
and Lb1 are 60 cm in the example shown in FIG. 4, it means that the sound pressure Gb3 and the
sound pressure Gb1 are almost equal, and the distances Lb3 and Lb1 are limited to the values
shown in the figure. I will not.
03-05-2019
11
[0031]
FIG. 5 is a diagram showing a method of identifying the voice of the wearer's own voice and the
voice of the other's voice. As described with reference to FIG. 4, the sound pressure Ga1 of the
first microphone 11a is several times (for example, about 4 times) the sound pressure Ga3 of the
third microphone 11c with respect to the voice of the wearer. Further, regarding the speech
voice of the other person, the sound pressure Ga1 of the first microphone 11a is approximately
equal to (about 1 times) the sound pressure Ga3 of the third microphone 11c. So, in this
embodiment, a threshold is set to ratio of sound pressure Ga of the 1st microphone 11a, and
sound pressure Ga3 of the 3rd microphone 11c. Then, the sound whose sound pressure ratio is
larger than the threshold is judged as the speaker's own speech, and the sound whose sound
pressure ratio is smaller than the threshold is judged as the other's speech. In the example shown
in FIG. 5, the threshold is 2 and the sound pressure ratio Ga1 / Ga3 exceeds the threshold 2, so it
is judged as the speaker's own speech, and the sound pressure ratio Gb1 / Gb3 is smaller than
the threshold 2 so the speech of the other person It is judged that.
[0032]
The voices recorded by the microphones 11a and 11c include so-called noise (noise) such as
environmental sound in addition to the voiced speech. The relationship of the distance between
the noise source and the microphones 11a and 11c is similar to that of the other person's speech.
That is, according to the example shown in FIGS. 4 and 5, the distance between the noise source
c and the first microphone 11a is Lc1, and the distance between the noise source c and the third
microphone 11c is Lc2. Then, the distance Lc1 and the distance Lc2 approximate each other. And
sound pressure ratio Gc1 / Gc3 in the sound recording of microphones 11a and 11c becomes
smaller than threshold 2. However, such noise is separated and removed from the speech by
performing filtering processing by an existing technique using a band pass filter, a gain filter, and
the like.
[0033]
<Operation Example of Terminal Device> FIG. 6 is a flowchart showing the operation of the
terminal device 10 in the present embodiment. As shown in FIG. 6, when the microphones 11a
and 11c of the terminal device 10 acquire voices, electrical signals (audio signals) corresponding
03-05-2019
12
to the acquired voices are sent from the microphones 11a and 11c to the first amplifier 13a and
the third amplifier 13c. Step 101). When the first amplifier 13a and the third amplifier 13c
acquire the audio signals from the microphones 11a and 11c, they amplify the signals and send
them to the audio analysis unit 15 (step 102).
[0034]
The voice analysis unit 15 performs filtering processing on the signals amplified by the first
amplifier 13a and the third amplifier 13c, and removes noise components such as environmental
sound from the signal (step 103). Next, the voice analysis unit 15 records the microphones 11a
and 11c at predetermined time units (for example, several tenths of a second to several
hundredths of a second) for the signal from which the noise component is removed. The average
sound pressure in the voice is determined (step 104). Then, a ratio (sound pressure ratio)
between the average sound pressure in the first microphone 11a and the average sound pressure
in the third microphone 11c is determined (step 105).
[0035]
Next, when the gain of the average sound pressure in each of the microphones 11a and 11c
obtained in step 104 is present (Yes in step 106), the voice analysis unit 15 determines that
there is an utterance voice (utterance is performed). Then, if the sound pressure ratio obtained in
step 105 is larger than the threshold (Yes in step 107), the voice analysis unit 15 determines that
the speech sound is a voice of the wearer's own speech (step 108). If the sound pressure ratio
obtained in step 105 is smaller than the threshold (No in step 107), the speech analysis unit 15
determines that the speech is a speech of another person (step 109). On the other hand, when
the gain of the average sound pressure in each of the microphones 11a and 11c obtained in step
104 does not exist (No in step 106), the voice analysis unit 15 determines that there is no voice
(voice is not performed) ( Step 110). Note that the determination in step 106 takes into account
the case where noise that could not be removed by the filtering process in step 103 remains in
the signal, and there is a gain if the value of the average sound pressure gain is a certain value or
more. You may judge. The order of the step 105 of determining the sound pressure ratio and the
step 106 of determining the presence or absence of the gain may be reversed, and the sound
pressure ratio may be determined only when there is a gain.
[0036]
03-05-2019
13
Thereafter, the voice analysis unit 15 causes the data transmission unit 16 to transmit the
information (presence or absence of speech, information of the utterer) obtained in the
processing of step 104 to step 110 to the host device 20 as an analysis result ( Step 111). At this
time, the length of the speech time of each speaker (the wearer or others), the value of the gain
of the average sound pressure, and other additional information may be transmitted to the host
device 20 together with the analysis result. Further, in the above-described example, the set of
the microphones 11a and 11c is used, but the same thing can be made using the set of the
microphones 11b and 11c.
[0037]
Next, a method of determining the three-dimensional position of the other person from the
information of the voice signal of the voice including the self-other discrimination result will be
described. This is derived in the data analysis unit 23 of the host device 20 in the present
embodiment. FIG. 7 is a diagram illustrating the data analysis unit 23 in more detail. As shown in
FIG. 7, the data analysis unit 23 determines a time difference deriving unit 231 for obtaining a
time difference in which the voice of another person reaches each microphone 11a, 11b, 11c,
and this time difference and the microphones 11a, 11b, 11c are separated. A face-to-face angle
deriving unit 232 for obtaining a face-to-face angle which is an angle at which the wearer and
the other face are derived from the distance, and an initial value for selecting an initial value of
numerical calculation for deriving the other's three-dimensional position from the face-to-face
angle An initial value selection unit 233 as an example of selection means, a LUT storage unit
234 storing a look up table (LUT) for the initial value selection unit 233 to select an initial value,
and an initial value selection unit 233 And a numerical calculation unit 235 as an example of a
position deriving unit that derives the three-dimensional position of the other by numerical
calculation using an initial value.
[0038]
FIG. 8 is a flowchart for explaining the operation of the data analysis unit 23. The operation of
the data analysis unit 23 will be described below using FIG. 7 and FIG.
[0039]
03-05-2019
14
First, the data analysis unit 23 acquires information related to the audio signal received by the
data reception unit 21 via the data storage unit 22 (step 201). Next, from the information on the
audio signal received by the data receiving unit 21 in the time difference deriving unit 231, the
time difference for the voice of the other person to reach each of the microphones 11a, 11b and
11c is determined by the method described later in detail (step 202) Specifically, specifically, the
time difference Δt12 between the first microphone 11a and the second microphone 11b, the
time difference Δt23 between the second microphone 11b and the third microphone 11c, and
the time difference between the third microphone 11c and the first microphone 11a The time
difference Δt 31 of each is calculated.
[0040]
Furthermore, in the facing angle deriving unit 232, the facing is the angle at which the wearer
and the other face each other based on the time differences Δt12, Δt23, Δt31 and the distance
between the microphones 11a, 11b, 11c by a method described later in detail. The angle is
determined (step 203). The facing angle will also be described in detail later. Specifically, the
facing angle α is determined based on the time difference Δt12 and the distance D12 between
the first microphone 11a and the second microphone 11b. Similarly, the facing angle β is
determined based on the time difference Δt23 and the distance D23 between the second
microphone 11b and the third microphone 11c, and the time difference Δt31 based on the
distance D31 between the third microphone 11c and the first microphone 11a. , Face-to-face
angle γ is determined.
[0041]
Next, the initial value selection unit 233 refers to the LUT storage unit 234 to select an initial
value for deriving the three-dimensional position of the other person (step 204). The method of
selecting this initial value will also be described in detail later. Then, the numerical calculation
unit 235 performs numerical calculation starting from the selected initial value, and derives the
other's three-dimensional position (step 205). The information on the other's three-dimensional
position is output to the output unit 24 (step 206).
[0042]
<Description of Method of Finding Time Difference of Arrival of Voice of Others to Microphones>
03-05-2019
15
The time difference Δt12, Δt21, Δt31 of the voice of the others to reach the microphones 11a,
11b, and 11c can be determined as follows. The following description will be made by taking the
case of obtaining the time difference Δt12 as an example, but the time differences Δt21 and
Δt31 can also be obtained by the same method.
[0043]
FIGS. 9A to 9C are diagrams for explaining the method of obtaining the time difference Δt12 in
the present embodiment. Among these, FIG. 9A is a diagram in which the voice of the other
person reaching the first microphone 11a and the second microphone 11b is sampled at a
sampling frequency of 1 MHz, and 5000 consecutive points are extracted from the data. Here,
the horizontal axis represents the data number assigned to each of the 5000 points of data, and
the vertical axis represents the amplitude of the voice of the other. The solid line is the waveform
signal of the voice of the other person who has reached the first microphone 11a, and the dotted
line is the waveform signal of the voice of the other who has reached the second microphone
11b.
[0044]
In the present embodiment, the cross-correlation function of these two waveform signals is
determined. That is, one waveform signal is fixed, and the other product is shifted and the
product sum is calculated. FIGS. 9 (b) to 9 (c) are diagrams showing cross-correlation functions
for these two waveform signals. Among these, FIG. 9 (b) is the cross correlation function of the
entire sampled 5000 point data, and FIG. 9 (c) is an enlarged view of the vicinity of the peak of
the cross correlation function shown in FIG. 9 (b). . In FIGS. 9B to 9C, the waveform signal of the
voice of the other person reaching the first microphone 11a is fixed, and the waveform signal of
the voice of the other person reaching the second microphone 11b is shifted for cross correlation
It shows the case of finding the function. As shown in FIG. 9C, the peak position is shifted by
−227 points with reference to the data number 0. This means that the voice of the other person
arriving at the second microphone 11b with reference to the first microphone 11a is delayed and
reached by this amount. In the present embodiment, since the sampling frequency is 1 MHz as
described above, the time between sampled data is 1 × 10 <−6> (s). Therefore, the delay time is
227 × 1 × 10 <−6> (s) = 227 (μs). That is, in this case, the time difference Δt12 is 227 (μs).
[0045]
03-05-2019
16
<Description of Face-to-face Angle> FIG. 10 is a view for explaining the face-to-face angle in the
present embodiment. In the present embodiment, the facing angle is the angle at which the
wearer of the terminal device 10 and the other face each other as described above. FIG. 10 shows
the facing angles defined in the present embodiment. Here, as an example of the facing angle
according to the present embodiment, a facing angle α based on the first microphone 11 a and
the second microphone 11 b is illustrated. In the present embodiment, as the facing angle α, a
line connecting the first microphone 11a and the second microphone 11b, which are two voice
acquisition means, and a line connecting the middle point of the line and the other Adopt an
angle. This makes mathematical handling of the facing angle α easier. When this definition is
adopted, for example, when the wearer and the other face the front and face each other, the
facing angle α between the two is 90 °. A facing angle β, which is a facing angle based on the
second microphone 11b and the third microphone 11c, and a facing angle γ, which is a facing
angle based on the third microphone 11c and the first microphone 11a, are similarly defined.
Can.
[0046]
<Description of Method of Determining Face-to-face Angle> FIG. 11 is a diagram for describing a
method of determining the face-to-face angle α using the first microphone 11 a and the second
microphone 11 b. Here, it is assumed that the point M1 is the position of the first microphone
11a, and the point M2 is the position of the second microphone 11b. Further, it is assumed that
the point S is the position of the other person. Here, the position of the other person is more
precisely the position of the utterance point which is the sound source of the voice of the other
person. And the sound emitted from the point S which is an utterance point spreads
concentrically. At this time, since the voice spreads at the speed of sound which is a finite speed,
the time at which the voice reaches the point M1 at the position of the first microphone 11a and
the time at the point M2 at the position of the second microphone 11b are different. A time
difference Δt12 corresponding to the path difference δ12 occurs. Assuming that the distance
between the point M1 and the point M2 is D12 and the distance between the middle point C12
of the point M1 and the point M2 and the point S is L12, the following equation (1) is
established.
[0047]
δ12 = (L12 <2> + L12D12 cos α + D12 <2> / 4) <0.5>-(L12 <2> -L12 D12 cos α + D12 <2> /
4) <0.5> (1)
03-05-2019
17
[0048]
In the case of L12> D12, since the influence of L12 is small, the equation (1) can be
approximated to the following equation (2).
[0049]
δ 12 D D 12 cos α (2)
[0050]
Further, using the sound velocity c and the time difference Δt12, the following equation (3) is
established.
[0051]
δ12 = cΔt12 (3)
[0052]
That is, the facing angle α can be obtained by using the equations (2) and (3).
In other words, based on the time difference Δt12 that the voice of the other person reaches the
first microphone 11a and the second microphone 11b, which are two voice acquisition means,
and the distance D12 at which the first microphone 11a and the second microphone 11b
separate, It is possible to derive a facing angle α, which is an angle at which the persons face
each other.
The facing angles β and γ can be derived similarly.
[0053]
<Description of Method of Deriving Three-Dimensional Position of Another Person> Next, a
method of deriving the three-dimensional position of another person will be described using the
facing angles α, β, and γ obtained as described above.
03-05-2019
18
First, assuming that the positions of the microphones 11a, 11b, and 11c are point M1, point M2,
and point M3, respectively, a triangle having the points M1, M2, and M3 as vertices is
ΔM1M2M3.
The three-dimensional coordinates of the vertices M1, M2 and M3 are here (xM1, yM1, zM1),
(xM2, yM2, zM2) and (xM3, yM3, zM3), respectively.
Further, the three-dimensional coordinates of the midpoint C12 of the line segment M1 M2, the
midpoint C23 of the line segment M2 M3, and the midpoint C31 of the line segment M3 M1 are
(xC12, yC12, zC12), (xC23, yC23, zC23), (xC31, xC31, yC31, zC31). Also, let three-dimensional
coordinates of the point S, which is the position of the other, be (x, y, z).
[0054]
FIG. 12 is a conceptual diagram showing the positional relationship between the point M1, the
point M2, the point M3, and the point S. In FIG. 12, a cone whose apex is a midpoint C12 of the
points M1 and M2 and whose half apex angle is α, and whose midpoint C31 of the points M3
and M1 is an apex and whose apex apex is γ is a solid line It shows by. And based on the
definition of the facing angle mentioned above, the point S will exist in any position of these
conical surfaces (the side of a cone). The same holds true for a cone having a point M2 and a
middle point C23 of the point M3 as a vertex and a half apex angle as β. That is, the point S is at
the intersection of the conical surfaces of these three cones. In FIG. 12, for the sake of easy
understanding of the explanation, a cone having a middle point C23 of the points M2 and M3 as
a vertex and a half apex angle of β is not shown.
[0055]
FIG. 13 is a diagram extracting and explaining a cone having the middle point C12 of the point
M1 and the point M2 as a vertex and the half apex angle as α. Here, in FIG. 13, a vertical line is
drawn from the midpoint C12 to the bottom of the cone, and an a vector extending in the
direction of the vertical line starting from the midpoint C12 is considered. Also, consider an r
vector extending in a direction along the conical surface starting from the midpoint C12. At this
time, the point M2 exists on this perpendicular line, and the angle between the a vector and the r
vector is the facing angle α. When the inner product of the a vector and the r vector is used to
03-05-2019
19
represent the relationship between the a vector, the r vector, and the facing angle α, the
following equations (4) and (5) hold.
[0056]
[0057]
[0058]
Further, using three-dimensional coordinates of the point M1, the middle point C12, and the
point S, the following equations (6), (7), and (8) are established.
[0059]
[0060]
[0061]
[0062]
When the relationship of the equation (8) is applied to the facing angles β and γ, the following
equations (9) and (10) are established.
[0063]
[0064]
[0065]
That is, the three-dimensional coordinates (x, y, z) of the point S which is the position of the other
can be obtained by solving the ternary and quadratic simultaneous equations by the three
equations shown in the equations (8) to (10) Can.
03-05-2019
20
As described above, in the present embodiment, the sound source of the voice of the other
person is positioned on the side of the cone whose apex is the middle point of the line connecting
two microphones and whose half apex angle is the facing angle. The three-dimensional position
of the other is derived using the location at the intersection of the sides of two cones.
[0066]
However, it is difficult to analytically solve the ternary and quadratic simultaneous equations (8)
to (10).
Therefore, in the present embodiment, the three-dimensional coordinates (x, y, z) of the point S
are obtained by a numerical solution method.
However, there is a problem that the solution is apt to diverge when solving nonlinear equations
such as the equations (8) to (10) even by the numerical solution method.
Whether the solution diverges or converges depends on how to give the initial value.
Therefore, in order to obtain a convergent solution, selection of an initial value is important.
[0067]
<Description of Method of Selecting Initial Value> Therefore, in the present embodiment,
candidate points to be candidates for the initial value are prepared in advance in the threedimensional space in which the terminal device 10 and others are placed. By selecting an initial
value that is closer to the three-dimensional coordinates (x, y, z) of S, a convergent solution is
easily obtained.
More specifically, a predetermined origin is set, and candidate points are set at predetermined
intervals in the x-axis, y-axis, and z-axis directions from the origin.
03-05-2019
21
[0068]
FIG. 14 is a conceptual diagram showing three-dimensional positions of candidate points to be
set.
Here, candidate points are set at equal intervals in the x-axis, y-axis, and z-axis directions from
the origin O.
This point can be set at intervals of 1 m, for example, in the range of 10 m in the x-axis, y-axis,
and z-axis directions.
In the present embodiment, the facing angles α, β, γ with respect to the candidate points set in
this way are obtained in advance. Then, the relationship between the three-dimensional
coordinates of the candidate point and the facing angles α, β, γ is held as a LUT (Look up
Table). In the present embodiment, this LUT is stored in the LUT storage unit 234. Then, the
initial value selection unit 233 refers to this LUT and compares the values of the actually derived
facing angles α, β and γ with the values of the facing angles α, β and γ stored in the LUT.
Then, for each of the facing angles α, β, γ, one having a relatively small difference in value is
selected, and three-dimensional coordinates corresponding to this are selected as initial values.
Thus, in the present embodiment, the initial value is selected from among the candidate points by
comparing the facing angle for the candidate point set in the three-dimensional space in which
the other is placed and becoming a candidate for the initial value with the facing angle for the
other. Choose
[0069]
Further, in the present embodiment, when selecting the initial value, the distance between the
speaker and the other person is further obtained, and this distance is also used together. FIG. 15
is a diagram illustrating the initial value selection unit 233 in more detail. As shown in FIG. 15,
the initial value selection unit 233 determines the tuneability of voice from the information on
voice signals of voices acquired from a plurality of wearers, and the tuneability determination
unit 233-1. The sound pressure of the sound acquired by the microphone disposed to the
speaker and the sound pressure acquired by the microphone disposed to the other when it is
03-05-2019
22
determined that the audio synchrony is present and the speech is identified to be a self-speech
Using the distance between the utterer and the other derived from the distance deriving unit
233-2 for deriving the distance from the utterer to the other from the sound pressure of the
voiced voice and the distance deriving unit 233-2 And an initial value determination unit 233-3
that selects an initial value by comparing the derived values of the facing angles α, β, and γ
with the values of the facing angles α, β, and γ stored in the LUT.
[0070]
FIG. 16 is a flowchart showing the operation of the initial value selection unit 233 in the present
embodiment. Hereinafter, the operation of the initial value selection unit 233 of the present
embodiment will be described using FIGS. 15 to 16. First, in the tuneability determination unit
233-1, the tuneability of the voices sent from the plurality of terminal devices 10 is determined
(step 301).
[0071]
Hereinafter, a method of determining the synchrony of voice information will be described. FIG.
17 is a view showing a state in which a plurality of wearers wearing the terminal device 10 of the
present embodiment are in conversation. FIG. 18 is a diagram showing an example of utterance
information of each of the terminal devices 10A and 10B in the conversation situation of FIG. As
shown in FIG. 17, it is assumed that two wearers A and B who are wearing the terminal device 10
are in conversation. At this time, voices uttered by the wearer A and the wearer B are captured by
both the terminal device 10A of the wearer A and the terminal device 10B of the wearer B.
[0072]
The terminal device 10A and the terminal device 10B independently transmit speech information
to the host device 20 (see FIG. 1). At this time, as shown in FIG. 18, the speech information
acquired from the terminal device 10A and the speech information acquired from the terminal
device 10B are information indicating the state of speech such as the length of the speech time
and the timing at which the speaker switched. Approximate. Therefore, the host device 20
according to this application example compares the information acquired from the terminal
device 10A with the information acquired from the terminal device 10B to determine that these
pieces of information indicate the same utterance situation, and the wearer It recognizes that A
03-05-2019
23
and the wearer B are in conversation. That is, the synchrony of the speaking state of the wearer A
and the wearer B can be determined. Here, as the information indicating the utterance status, at
least the length of the utterance time in each utterance of each speaker mentioned above, the
start time and end time of each utterance, the time (timing) when the speaker is switched, etc. As
such, time information on the utterance is used. Note that only part of the time information on
these utterances may be used to determine the utterance status of a particular conversation, or
other information may be used additionally.
[0073]
If it is determined that there is no synchrony with any of the plurality of wearers (No in step
301), the process returns to step 301. On the other hand, when it is determined that there is
synchrony for any of a plurality of wearers (Yes in step 301), next, the distance between the
wearer A and the wearer B is derived by the distance deriving unit 233-2. (Step 302).
[0074]
In the present embodiment, in order to derive the distance between the wearer A and the wearer
B, information of sound pressure of sound is used. Referring back to FIG. 17, when the wearer A
is speaking, that is, when the wearer A is a speaker (in this case, the wearer B is the other
person), the microphone 11Aa of the wearer A acquires The sound pressure Lp1 of the voice of
the wearer A (speaker) and the sound pressure Lp2 of the voice of the wearer A (speaker)
acquired by the microphone 11Ba of the wearer B (the other person) are considered. When the
distance between the mouth of the wearer A (speaker) and the microphone 11Aa is r1, and the
distance between the mouth of the wearer A (speaker) and the microphone 11Ba is r2, the
following equation (11) is established. Lp1−Lp2=20log(r2/r1) …(11)
[0075]
At this time, since Lp1 and Lp2 are included in the information transmitted from the terminal
device 10 and r1 is known in advance, r2 can be obtained from the above equation (11). That is,
the distance r2 between the mouth of the wearer A (speaker) and the microphone 11Ba can be
regarded as the distance between the wearer A (speaker) and the wearer B (the other person).
can do.
03-05-2019
24
[0076]
The distance between the two can be derived by a method using a LUT in addition to the method
of calculation using a mathematical expression such as equation (11). That is, the relationship
between Lp1, Lp2 and r2 is put together as a LUT, and r2 can be obtained by referring to this
LUT.
[0077]
In the present embodiment, the self / other identification result is used to determine whether the
voice acquired at this time is the voice of the wearer himself or the voice of another person other
than the wearer. That is, when the voice is identified as the user's own uttered voice (self-speech),
the distance to another person based on the utterer is derived. By using this self-other
identification result, the criteria for deriving the distance become clear, and the distance between
the speaker and the other can be derived more accurately. When this self-other identification
result is not used, it becomes unclear which person utters the voice acquired by the microphone,
and it becomes difficult to derive the distance accurately. Furthermore, in the present
embodiment, when the sound acquired by the microphone is noise such as the sound of the air
conditioner or the sound of construction, no terminal device 10 is identified as the voice of the
wearer's own voice. It can be judged and eliminated. In this case, of course, the derivation of the
distance is not performed.
[0078]
The initial value determination unit 233-3 then uses the distance between the speaker and the
other derived in this manner, and the values of the derived facing angles α, β, and γ and the
facing stored in the LUT. The initial value is selected by comparing the values of the angles α, β
and γ (step 303). Specifically, for example, the values of the facing angles α, β, and γ stored
in the LUT are compared. Then, for each of the facing angles α, β, γ, some with relatively small
differences in value are selected as candidates. Then, from among the selected candidates, one
having the shortest distance between the speaker derived as described above and the other is
determined as an initial value.
[0079]
03-05-2019
25
After the initial value is selected by the above method, the numerical calculation unit 235
performs numerical calculation using this initial value, and solves the ternary and quadratic
simultaneous equations (8) to (10). The obtained convergent solution is the three-dimensional
coordinates (x, y, z) of the point S which is the position of the other. In the present embodiment,
the method of numerical calculation is not particularly limited, and a general method such as the
Newton method or the bisection method can be used.
[0080]
According to the voice analysis system 1 of the present embodiment described in detail above,
the three-dimensional position of the other person can be measured with a simpler configuration.
And while the amount of calculation is hard to become huge, the three-dimensional position of
the other person can be measured more accurately.
[0081]
In the example described in FIG. 17 and FIG. 18, both the wearer A and the wearer B use the
terminal device 10 as described in FIG. 2. A plurality of microphones 11a and 11c disposed at
different positions from the wearer's mouth may be disposed on the other, and the other may
have only one microphone, for example. That is, in the terminal device 10 in which the plurality
of microphones 11a and 11c arranged at positions different in distance from the mouth, it can be
determined first whether or not the wearer self-utters. Therefore, the wearer who uses this
terminal device 10 can be defined as a standard for deriving the distance to another person. And,
in order to derive the distance to the other party, it is not necessary for the other party to use the
terminal device 10, and it is sufficient if there is information on the sound pressure from the
microphone disposed to the other party. That is, only one microphone may be disposed to
another person.
[0082]
In the above-described example, the determination of the self-other identification is performed by
the terminal device 10. However, the present invention is not limited to this, and the host device
20 may perform the determination. As the voice analysis system 1 in this embodiment, for
03-05-2019
26
example, the data analysis unit 23 of the host device 20 performs the self-other identification
function performed by the voice analysis unit 15 with respect to that of FIG. In this case, the data
analysis unit 23 functions as a self / other identification unit that identifies whether the voice
acquired by the microphones 11a and 11c is a speech of the wearer or a speech of a person
other than the wearer.
[0083]
<Description of Program> The processing performed by the host device 20 in the present
embodiment is realized by cooperation of software and hardware resources. That is, a CPU (not
shown) in the control computer provided in the host device 20 executes a program for realizing
each function of the host device 20 to realize each of these functions.
[0084]
Therefore, the processing performed by the host device 20 includes voices from microphones
distributed to the computer at a plurality of positions distributed to the wearer and at least one
of the wearers at different distances from the mouth of the wearer. From the function of
receiving information on the audio signal, the function of determining the audio synchrony from
the information on the audio signal of the audio obtained from a plurality of wearers, it is
determined that the audio synchrony exists and from the wearer's mouth The voice acquired by
the microphone disposed to the speaker when the speech is identified as the self-speech based
on the comparison result of the audio signals of the speech acquired by the plurality of
microphones disposed at different positions of And the function of deriving the distance to
another person based on the speaker from the sound pressure of the voice and the sound
pressure of the voice acquired by the microphone disposed to the other person It can be
regarded as a program causing revealed.
[0085]
DESCRIPTION OF SYMBOLS 1 ... Voice analysis system, 10 ... Terminal device, 11a ... 1st
microphone, 11b ... 2nd microphone, 11c ... 3rd microphone, 15 ... Voice analysis part, 16 ... Data
transmission part, 20 ... Host device, 21 data reception part , 23: data analysis unit, 30: device
main body, 40: strap, 231: time difference deriving unit, 232: facing angle deriving unit, 233:
initial value selecting unit, 233-1 ... tunability determining unit, 233-2 ... Distance derivation unit,
233-3 ... initial value determination unit
03-05-2019
27