close

Вход

Забыли?

вход по аккаунту

JP2015119248

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2015119248
The present invention relates to a next-generation three-dimensional acoustic telephone system
which transmits three-dimensional sound, which is three-dimensional sound, which is not found
in conventional telephone systems, over the Internet and has a sense of presence. A user of a
stereophonic audio telephone system wears a two-channel microphone built-in type earphone on
the left and right ears and connects it to a portable terminal. The voice of the site collected by the
microphones of the left and right ears of one user is recorded on a voice portable terminal and
encoded, and the voice data of each is transmitted to the other user using two lines of the
existing IP telephone system. Send to your mobile device. The next-generation stereophonic
telephone system is realized by a two-way IP telephone system in which decoding is performed
by the portable terminal of the other user and the audio corresponding to the left and right is
reproduced by the earphone in the ear. [Selected figure] Figure 1
Stereophonic IP Phone with Binaural Recording
[0001]
The present invention relates to a next-generation Internet telephone system (hereinafter,
referred to as stereophonic IP telephone system) which transmits three-dimensional sound
(hereinafter, stereophonic sound), which is three-dimensional sound, over the Internet using
binaural recording. It is a thing. That is, in the stereophonic IP telephone system, a telephone
service (hereinafter referred to as "telephone service") utilizing stereophonic sound for the
Internet circuit. We propose a stereophonic IP telephone service). As shown in Non-Patent
Document 1, with stereo sound, sound existing in a certain space is recorded, and information
such as the direction of the sound source perceived by human beings and the distance to the
10-05-2019
1
sound source is sterically reproduced at the time of reproduction It is a method to reproduce. A
person who listens to the reproduced voice (hereinafter referred to as human 2 ) by using the
stereophonic sound for the IP telephone system listens to the voice of the site where the voice is
generated (hereinafter referred to as the on-site voice) By listening to the same recorded voice as
in the human 1), it is expected to obtain a sense of reality as if it were in the space of the site
where the voice is generated. In particular, in the present invention, in order to provide a service
by extending the IP telephone service of the existing IP telephone system, the stereophonic IP
telephone service using the transmission coding scheme for IP telephone currently used is
provided. consider. That is, when three-dimensional sound is transmitted on an IP network which
is a best effort network, the quality of service is degraded due to the delay and loss of IP packets.
Therefore, while clarifying the influence that degradation of the communication quality of IP
network has on stereophonic IP telephone service, the invention concerning the quality
improvement is proposed.
[0002]
The prior art of the present invention is a stereophonic sound unrelated to the IP telephone
system, and there is, for example, Non-Patent Document 2, which synthesizes speech at multiple
points to construct a pseudostereoscopic acoustic environment It is not related to the object of
the present invention. There is a lot of research related to stereophonic sound, but there is no
research that deals with transmitting stereophonic sound over IP networks. On the other hand,
although Non-Patent Document 3 and Non-Patent Document 4 include conventional methods for
evaluating the quality of IP telephone services, these quality evaluation methods do not take into
consideration stereophonic sound.
[0003]
The Acoustical Society of Japan, New Dictionary of Acoustical Terms, Corona, July 2003. Shinya
Iizuka, Kei Kikuri, Nobuhiko Naka. Surround voice transmission technology for mobile multipoint
voice chat. NTT DOCOMO Technical Journal, Vol. 17, No. 2, pp. 25-29, July 2009. Nobuhiko
Kitawaki. Mobile phone coding-voice coding, speech environment characteristics, speech quality.
Journal of the Acoustical Society of Japan, Vol. 58, No. 12, pp. 780-785, 2002. Nobuhiko
Kitawaki. Voice quality evaluation of IP phones. Journal of the Acoustical Society of Japan, Vol.
63, No. 11, pp. 680-685, 2007. ITU-T, G.711: Pulse Code Modulation (PCM) of Voice
Frequencies, "Nov. 1988. J. M. Valin, Speex: A Free Codec For Free Speech, "2002. Xiph.
OrgFoundation. Shizuo Nishiyama. Kazuo Iketani. Yamaguchi Yoji. Motoyoshi Okushima. Acoustic
Vibration Engineering / Corona Company. 1979. Masaaki Nishimaki. Electroacoustic Vibrational
10-05-2019
2
/ Corona Company. 1978.
[0004]
As described above, the prior art relates to the next-generation Internet telephone system
(hereinafter referred to as stereophonic IP telephone system) that transmits three-dimensional
sound (hereinafter stereophonic), which is three-dimensional sound, over the Internet and has a
sense of presence There is no.
[0005]
The present invention proposes a stereophonic IP telephone system as a next-generation
telephone system in the Internet using binaural stereophonic sound.
The three-dimensional sound by the three-dimensional sound IP telephone service by the threedimensional sound IP telephone system can provide information that can not be perceived from
visual information such as a 3D image, such as the positions of the left and right and rear sound
sources. Therefore, the realization of the stereophonic IP telephone system of the present
invention will be a next-generation Internet telephone service that provides a high sense of
reality that has not been provided in the past. Therefore, the present invention aims to greatly
improve the quality of life (QOL) by providing new services on the Internet.
[0006]
In order to achieve the above object, the invention according to claim 1 performs binaural
recording with the microphone of a two-channel microphone built-in type earphone mounted on
the left and right ears of one of the human beings, The portable terminal respectively makes it an
encoded signal, and transmits the encoded signal to the other human portable terminal using two
telephone lines on the Internet, and the other portable terminal decodes the received encoded
signal. The present invention relates to a two-way IP telephone system which reproduces a
binaural output by a two-channel microphone built-in earphone attached to the other human's
left and right ears. According to the present invention, the voice of the site where the sender is
listening with the left and right ears can be transmitted using two lines of the existing IP
telephone system and can be reproduced as it is by the receiver's left and right ears. An acoustic
IP telephone system can be built. The invention according to claim 2 is characterized in that the
one and the other portable terminals are provided with a 2-channel AD / DA converter, an
10-05-2019
3
equalizer, and a packet generator / packet receiver. It is an IP telephone system of statement.
According to the present invention, it is possible to divide recorded voices by frequency band and
transmit and reproduce them. The invention according to claim 3 is that the IP telephone system
according to claims 1 and 2, characterized in that voice data having a band component of a
frequency band of 1,000 Hz to 3,000 Hz is preferentially handled. . According to the present
invention, it is possible to minimize the effect of packet loss due to the IP telephone system and
maintain speech intelligibility similar to that of on-site speech. The invention according to claim 4
emphasizes the voice in the frequency band of 2,000 Hz to 3,000 Hz on the receiver side among
the voices recorded from the sender side. It is a system. According to the present invention, it is
possible to improve the sound source localization ability in the front-rear direction, which is
difficult to identify even in the sound data at the site.
[0007]
Overall configuration of stereophonic IP telephone system using binaural recording which is
Embodiment 1 of the present invention Experimental apparatus for evaluating effectiveness of
the present invention Experimental results of evaluation of effectiveness of the present invention
For comparing the evaluation of effectiveness of the present invention Experimental Results
Overall Configuration of Stereo Acoustic IP Telephone System Using Binaural Recording of
Embodiments 2 and 3 of the Present Invention
[0008]
(Embodiment 1) The features of Embodiment 1 which is the basic configuration of the
stereophonic IP telephone system of the present invention will be described below with reference
to the drawings.
[0009]
FIG. 1 shows the overall configuration of a stereophonic IP telephone system using binaural
recording according to a first embodiment of the present invention.
The user of one stereophonic IP telephone system is referred to as human 1 and the other user is
referred to as human 2.
Human 1 and Human 2 wear microphone built-in earphones on the left and right ears. Here, the
stereophonic IP telephone system will be described with human 1 as the sender and human 2 as
10-05-2019
4
the receiver. The voices uttered by the human 1 and the voice heard by the human 1 from the
surroundings are collected through the microphones worn by the human 1 on the left and right
ears, and input to a portable terminal, a smart phone or the like (hereinafter, portable terminal).
The portable terminal 1 of the human 1 records the recorded voice signal by the binaural
method, and performs signal correction and coding (hereinafter, coding). The encoded voice is
transmitted to the portable terminal 2 of the human 2 via the Internet. That is, on the Internet, a
voice signal is transmitted from the portable terminal 1 of the human 1 to the base station 1 in
the area of the human 1 and is transmitted from the base station 1 to the base station 2 in the
area of the human 2. Further, it is transmitted from the base station 2 to the portable terminal 2
of the human 2. The portable terminal 2 decodes the audio signal into binaural audio. The
decoded voice is reproduced as a binaural output by the microphone internal earphone worn by
the human 2.
[0010]
Here, in the stereophonic IP telephone system, earphones with built-in microphones are
respectively attached to the left and right ears, and voice signals are generated as separate voices
and transmitted. Therefore, two channels in the existing IP telephone system are used. Then, the
voice recorded at the right ear of human 1 is transmitted using one line, reproduced at the right
ear of human 2, and the voice recorded at the left ear of human 1 using the other one line Send
and play at the left ear of Human 2. The left and right signals may be monaural using a line of a
conventional IP telephone system currently used. By treating the left and right separate signals
as described above, it becomes a three-dimensional sound that is a three-dimensional sound.
[0011]
Since the encoding and transmission of collected voice in the portable terminal etc. 1 is
performed instantaneously (about 5 ms) and the decoding of the encoded signal received in the
portable terminal etc. 2 is also performed instantaneously (about 5 ms), the Internet If the line of
the IP telephone system is valid, the sound collected by the ear of the human 1 is instantly
reproduced by the ear of the human 2. Further, in the stereophonic IP telephone system of the
first embodiment, both human 1 and human 2 who are users wear the earphones with built-in
microphones of 2 channels, so that human 2 can also transmit to human 1. Therefore, two-way
transmission and reception are possible and have a function as a telephone.
[0012]
10-05-2019
5
Three points to be considered in order to provide stereophonic IP telephone service are three
points of the listener's ability to localize the sound source (hereinafter referred to as sound
source localization ability), stereophonic sound recording / reproduction system, and
stereophonic sound transmission coding system There is. The sound source localization ability is
the ability of the human being 2 who is a listener to perceive a sound image based on the sound
emitted from the stereophonic IP telephone service and judge the spatial property of the sound
source. When the sound source was properly localized, the transmitter (human 1) was placed in
the direction of the sound source or the distance from the sound source, which is the spatial
property of the sound source perceived by the listener (human 2) by the audio signal. It matches
the distance from the human being 1 to the direction of the sound source or the sound source,
which is a spatial property. Based on the fact that it is an IP telephone service, a binaural method
is adopted as a stereophonic sound recording and reproduction method. The binaural method is
a method of presenting the sound of two channels on the left and right, recorded by an artificial
head model in which microphones are embedded in parts of both ears called a dummy head, to
the listener's both ears using headphones. . Specifically, the user of the IP telephone service uses
a 2-channel microphone built-in earphone to record left and right 2 recorded with an artificial
head model in which microphones are embedded in the binaural part of the binaural dummy
head. Audio equivalent to that of the channel can be collected by the microphone in the built-in
earphone, so recording is also possible. It can also be played by earphones. By using the binaural
method, the sound heard by the sender (human 1) who is the user of the IP telephone service
with the left and right ears can be reproduced as it is with the left and right ears of the listener
(human 2). It becomes possible to reproduce the interaural difference due to the influence of the
department. Therefore, it is also possible to reproduce information which is usually used as a
clue for sound source localization ability. Moreover, in binaural recording, since creation and
restoration of stereophonic sound can be realized easily and at low cost with a microphone builtin earphone without using a large number of speakers and microphones, it is very suitable for
telephone service. The stereophonic coding scheme is selected from among the coding schemes
used for IP phones. For example, ITU-TG.711 shown in Non-Patent Document 5 or Speex shown
in Non-Patent Document 6 is adopted. ITU-T G.711 is a voice coding method widely used in
telephone service ISDN and fixed telephone networks. The code bit rate is fixed at 64 kb / s.
Speex, on the other hand, is a voice coding method that is assumed to be adopted in a telephone
service application using VoIP in an IP network.
[0013]
(Confirmation of Effectiveness) The inventors confirmed the effectiveness of the stereo acoustic
IP telephone system in the following experiment. That is, this experiment is to evaluate the
10-05-2019
6
influence of the three-dimensional sound coding method and the communication quality of the IP
network on the sound source localization by the experiment using the subject.
[0014]
The experimental apparatus is shown in FIG. A dummy head (model: KU-100) was placed in place
of the human 1 in the room 1, and microphones were attached to the left and right ears. A female
announcement voice was output as a voice for evaluation from a speaker (model: AT-SPB30)
placed in front of the dummy head. The outputted evaluation sound is collected through the two
microphones in the left and right ears of the dummy head and amplified by the microphone
amplifier. The amplified evaluation voices are sent to the voice recording / coding terminal as
analog signals of the left and right channels. A voice recording / coding terminal encodes each
channel in a linear PCM format and treats it as an original sound before transmission. Thereafter,
it is subjected to encoding processing for transmission, and is transmitted to the audio decoding /
reproducing terminal in the room 2. Transmission is performed by the network emulator via the
Internet IP telephone system. The ITU-TG. 711 (64 kb / s) format and the Speex format in which
the coding bit rate is changed are used as the condition of the coding scheme at the time of
transmission. Note that the audio decoding / playback terminal in the room 2 is in a place where
it is not possible to directly hear the audio of the speaker in the room 1. In room 2, the subject
(human) listens to the voice with headphones (model: ATH-T300) attached to the left and right
ears corresponding to the left and right ears of the dummy head, and confirms and evaluates the
sound source localization ability.
[0015]
The concrete experimental result of sound source localization ability evaluated using FIG. 3 using
the experimental apparatus of FIG. 3 is shown in FIG. The horizontal axis is the direction of the
speaker that is the sound source to the dummy head in the room 1 that is the actual sound
source direction, and the vertical axis is the direction of the sound source evaluated by humans
in the room 2 that is the direction of the sound source that the subject heard . The directions to
be evaluated were eight ways of 45 °, 90 °, 135 °, 180 °, 225 °, 270 °, and 315 ° in the
counterclockwise direction, assuming that the speaker was placed in front of the dummy head as
0 °. . That is, the right ear is 90 °, the rear 180 °, and the left ear 270 °. FIG. 3 shows the IP
packet loss rate of 3% and the results of 24 subjects. The direction of the sound source and the
evaluation direction of the right ear 90 ° are 20 subjects, and the direction of the left ear 270 °
is 22 similarly. Therefore, the left and right direction can specify the actual sound source
direction.
10-05-2019
7
[0016]
Furthermore, in order to carry out comparative verification of the experimental results of FIG. 3,
additional experiments were conducted to obtain the experimental results of FIG. In this example,
24 subjects (humans) blinded in place of the dummy head in the room 1 and the sound source
direction similar to that shown in FIG. 3 was evaluated. Therefore, FIG. 4 shows the sound
localization ability of the on-site speech, and it can be said that the human being at the site
listens to the three-dimensional sound, which is a three-dimensional sound, and experiences the
realism. Comparing Figure 3 and Figure 4, the right ear, 90 ° direction, has 21 subjects in
Figure 4 with 20 subjects with the same direction of sound source and its evaluation direction,
and Figure 20 has 20 people, and the left ear In the direction of 270 °, which is 23 in FIG. 4, 22
are in FIG. Also in the experimental results in the other directions, although there are variations
due to individual differences, FIG. 4 which is the on-site voice and FIG. 3 which is the result of the
stereo acoustic IP telephone system of the present embodiment are substantially consistent
There is. From the above, the subject was able to accurately perceive the directions of the left and
right sound sources, and the effectiveness of the stereo acoustic IP telephone system of
Embodiment 1 could be confirmed. When 3D sound transmission is concerned on the IP network,
which is a best effort network, which is initially concerned, the rate of loss of service is 3% for the
degradation of service quality due to IP packet delay and loss etc. It has also been found that
there is no problem at the level related to communication by ordinary human conversation such
as a female announcement voice used as an evaluation voice. Therefore, the usual IP telephone
system currently used can be used as a stereophonic IP telephone system.
[0017]
(Embodiment 2) Embodiment 2 of the present invention relates to a method of sound quality
improvement. As an important frequency band in speech sound quality, Non-Patent Document 7
discloses that a frequency band contributing to speech intelligibility is 250 Hz to 7,000 Hz, and
an important frequency band is 250 Hz to 3,400 Hz. Further, Non-Patent Document 8 discloses
that Non-Patent Document 7 can keep the clarity at about 90% by passing the range of 1,000 Hz
to 3,000 Hz. Therefore, although cutting off low frequencies does not affect the clarity much, it
can be said that cutting off high frequencies significantly reduces the clarity of consonants. The
second embodiment prevents the deterioration of the sound quality by preferentially handling
the audio data having the band components of 1,000 Hz to 3,000 Hz.
10-05-2019
8
[0018]
FIG. 5 shows the configuration of the Internet telephone system of the second embodiment. The
second embodiment adds the following configuration to the encoder / decoder in the portable
terminal of FIG. Specifically, one transmitter sends the sound collected by the microphone built-in
type earphone to the portable terminal. In the portable terminal, an analog signal is converted to
a digital signal by a two-channel AD converter. The sound image correction equalizer divides this
into sound data for each frequency band. The voice data is converted by the encoder into
transmission data of the IP telephone system. Furthermore, priority control for sound quality
improvement is performed by the packet generator, and data of a prioritized frequency band is
preferentially transmitted to the portable terminal of the other receiver on the network. In the
portable terminal of the receiver, the processing of the frequency which prioritized the data
received by the packet receiver by the priority control for sound quality improvement is
performed. The data received with priority is decoded into voice data by a decoder. Next, the
frequency band which gives priority to the frequency band is amplified by the sound quality /
sound image correction equalizer. This is converted from a digital signal to an analog signal by a
two-channel earphone / DA converter, and reproduced and output by a receiver's two-channel
microphone built-in earphone. The frequency to be preferentially controlled here is in the range
of 1,000 Hz to 3,000 Hz. That is, by preferentially using the frequency band contributing to the
intelligibility of speech, it is possible to improve the speech intelligibility of speech while
minimizing the deterioration in speech quality of speech. Here, although the configuration has
been described in which transmission is performed from one sender to the other, in practice,
since this is a two-way IP telephone system, an AD converter and a DA converter, a sound image
correction equalizer, sound quality and sound image The correction equalizer, the encoder and
decoder, the packet generator and the packet receiver are provided at the mobile terminal used
by both the sender and the receiver. In the second embodiment, among the function of the sound
image correction equalizer and the sound quality / sound image correction equalizer, the
function of the sound image correction equalizer is not used.
[0019]
(Third Embodiment) In the first and second embodiments, the stereophonic IP telephone system
is constructed by faithfully transmitting and reproducing the on-site speech of the sender to the
receiver, but on the other hand, in FIGS. At 0 ° in the forward direction and 180 ° in the
backward direction, the number of subjects able to accurately sense the direction of the sound
source is smaller than 90 ° and 270 ° in the left-right direction. Therefore, the invention of the
third embodiment emphasizes and reproduces a frequency band in which the brain recognizes
speech from behind among the recorded speech, and further reproduces a stereophonic IP
10-05-2019
9
telephone system having a sense of presence in the receiver. provide. This three-dimensional
sound makes the receiver more realistic for the receiver without the sender's visual information.
[0020]
In general, it is known that high frequency band speech is more difficult to hear from the rear
than speech from the front. Therefore, in the third embodiment, the sound in the frequency band
of 2,000 Hz to 3,000 Hz is emphasized and reproduced. The configuration of the third
embodiment is the same as that of FIG. 5, and the sound image correction equalizer functions in
the transmission side and reception side portable terminals.
[0021]
By combining the service of the present invention with video transmission (especially
stereoscopic video transmission), a position information system, etc., it is possible to provide a
more realistic IP telephone service.
[0022]
DESCRIPTION OF SYMBOLS 1 Microphone built-in type earphone (two channels) 1-2 of human
11-1 human 1 Mobile terminal 11-3 of human 1 Base station 12 of area of human 1 Microphone
built-in type earphone (two channels) of human 22-1 human 2 2-2 Mobile terminal 22-3 of
human 2 Base station 23 in area of human 2 Mobile terminal 3-1 of Embodiment 2 and
Embodiment 3 AD / DA converter 3-2 Sound quality / sound image correction equalizer 3-3
Encoder / Decoder 3-4 packet generator / packet receiver
10-05-2019
10
1/--страниц
Пожаловаться на содержимое документа