close

Вход

Забыли?

вход по аккаунту

JP2012109643

код для вставкиСкачать
Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2012109643
A sound reproduction system 10a (10b, 10c) includes a computer 18 (26, 34), and the computer
18 generates voice data corresponding to another user's voice from another computer 26, 34
and its own user. When angle data corresponding to the direction of the face of another user
with respect to the position is received, the received sound data is convoluted using a voice filter
according to the direction (angle) of the face indicated by the angle data. Then, using the speaker
array 20 having a plurality of speakers connected to the computer 18, the folded sound data is
output. Therefore, the voice according to the direction of the face of the user of the other
computer 26, 34 is reproduced. [Effect] Since the direction of the person who generated the
sound can be known by the reproduced sound, conversation can be smoothly performed.
[Selected figure] Figure 1
Sound reproduction system, sound reproduction apparatus and sound reproduction method
[0001]
The present invention relates to a sound reproduction system, a sound reproduction apparatus,
and a sound reproduction method, and more particularly, to a sound reproduction system, a
sound reproduction apparatus, and a sound reproduction method using, for example, a
microphone unit having a plurality of microphones and a speaker unit having a plurality of
loudspeakers. About.
[0002]
Non-Patent Document 1 discloses an example of a conventional sound reproduction system of
03-05-2019
1
this kind.
In the three-dimensional sound field communication system disclosed in this non-patent
document 1, a sound surface control (Boundary Surface Control: BoSC) reproduction system is
realized, which reproduces sound data recorded by a 70ch (channel) microphone array with a
62ch loudspeaker. Using it, it is possible for users located in remote places to talk while sharing
the acoustic space. Specifically, the sound field data of 62 ch in which the inverse filter is
convoluted in advance is stored in the server. To this server, two client machines (PCs) located at
different places are connected via a network such as the Internet and a LAN. A three-dimensional
sound field reproduction system is connected to each client machine. The server simultaneously
transmits the reproduction sound field selected by the user to both sound reproduction systems
(speaker array systems). The voice data corresponding to the voice of the user of each sound
field reproduction system is transmitted to the other client machine via the network. In each
client machine, voice data (1 ch) corresponding to the voice of the other user is superimposed in
real time and then superimposed on sound field data (62 ch) and output. Therefore, users
present in different places can share sound field data output from the server and talk.
[0003]
「1. Numerical Analysis Technology and Visualization / Audification 1.7 Three-Dimensional
Sound Field Communication System "Satoshi Enomoto Acoustics Technology No.148 / Dec.2009
pp37-42
[0004]
However, in the three-dimensional sound field communication system of Non-Patent Document 1,
in each client machine, voice data (1 ch) corresponding to the voice of the other user is
convoluted using a voice filter prepared in advance. Since it is only superimposed on the sound
field data (62 ch) and output, it can not be recognized from the reproduced speech where the
other user is facing and talking. Therefore, when a client machine and a sound reproduction
system are connected to the 3D sound communication system of the background art, when three
or more users talk, it is recognized who is talking to whom It is difficult. Because of this, you can
not talk smoothly.
[0005]
03-05-2019
2
Therefore, the main object of the present invention is to provide a novel sound reproduction
system, sound reproduction device and sound reproduction method.
[0006]
Another object of the present invention is to provide a sound reproduction system, a sound
reproduction apparatus, and a sound reproduction method that can recognize the direction of the
person who generated the sound from the reproduced sound.
[0007]
The present invention adopts the following configuration in order to solve the above-mentioned
problems.
The reference numerals in the parentheses, the supplementary explanation, and the like indicate
correspondences with the embodiments to be described later in order to help the understanding
of the present invention, and the present invention is not limited at all.
[0008]
A first invention is a sound reproduction system comprising a plurality of sound reproduction
devices provided with a speaker array having at least a plurality of first loudspeakers, wherein
each sound reproduction device corresponds to an audio filter provided at each angle Filter
storage means for storing voice filter data, sound detection means for detecting sound data
corresponding to the sound generated by the user, corresponding to the direction in which the
user generates the sound with reference to the direction of the other user Angle detection means
for detecting angle data, data transmission means for transmitting sound data detected by sound
detection means and angle data detected by angle detection means to another sound
reproduction device, sound from another sound reproduction device First data receiving means
for receiving data and angle data, voice filter data corresponding to an angle indicated by the
angle data received by the first data receiving means A sound processing means for performing a
convolution process on sound data received by the data receiving means using an audio filter
corresponding to the audio filter data read out from the Luta storage means, and a sound
subjected to the convolution process by the sound processing means A sound reproduction
system comprising sound output means for outputting data to a speaker array.
[0009]
03-05-2019
3
In the first invention, in the sound reproduction system (10), a sound reproduction device (18,
20, 26, 28) comprising at least a speaker array (20, 28, 36) having a plurality of first
loudspeakers (230). 34, 36) are provided.
Each sound reproduction device includes filter storage means, sound detection means, angle
detection means, data transmission means, first data reception means, sound processing means,
and sound output means.
The filter storage means stores voice filter data corresponding to the voice filter provided at each
angle. The sound detection means detects sound generated by the user, for example, sound data
corresponding to the sound of the user or the sound of an instrument played by the user. The
angle detection means detects angle data corresponding to the direction in which the user
generates a sound based on the directions of the other users. The data transmission means
transmits the sound data detected by the sound detection means and the angle data detected by
the angle detection means to another sound reproduction device. The first data receiving means
receives sound data and angle data from another sound reproduction device. The sound
processing means reads from the filter storage means voice filter data according to the angle
indicated by the angle data received by the first data receiving means, and receives the first data
using the voice filter corresponding to the read voice filter data. A convolution process is
performed on the sound data received by the means. The sound output means outputs the sound
data subjected to the convolution processing by the sound processing means to the speaker
array.
[0010]
According to the first aspect of the invention, the voice filter corresponding to each angle is
stored, and the voice data from the other sound reproduction device is the voice filter
corresponding to the angle indicated by the angle data from the other voice reproduction device.
The speaker array can reproduce sound in the direction indicated by the angle. For this reason,
the direction of the person who generated the sound can be known by the reproduced sound.
Therefore, the user of the speaker array can, for example, recognize who is talking to whom from
the reproduced sound and can talk smoothly.
[0011]
03-05-2019
4
A second invention is according to the first invention, wherein the audio filter arranges a
microphone array having a plurality of microphones in a predetermined orientation at a certain
position, and a second loudspeaker to face the microphone array , And generates stimulation
sound from the second loudspeaker and is rotated based on a predetermined angle, based on an
impulse response measured by the microphone array.
[0012]
In the second invention, a microphone array having a plurality of microphones is arranged at a
predetermined orientation at a certain position, and a second loudspeaker is arranged to face the
microphone array.
That is, the microphone array is placed as a listener and the second loudspeaker is placed as a
speaker. Then, when the stimulation sound is generated from the second loudspeaker and rotated
by a predetermined angle, an impulse response measured by the microphone array is measured.
The transfer characteristic is measured from the impulse response measured by each
microphone, and an audio filter for each rotation angle of the second loudspeaker is generated.
[0013]
According to the second invention, the voice filter is generated based on the pre-measured
impulse response using the loudspeaker and the microphone array at a certain place, so that the
user who talks using the sound reproduction device is You can get the feeling of being in a
conversation.
[0014]
A third invention is according to the second invention, and the second loudspeakers are arranged
at a predetermined distance in the direction of a predetermined angle from the front direction of
the microphone array.
[0015]
In the third invention, the second loudspeakers are arranged at a predetermined distance in the
direction of the predetermined angle from the front direction of the microphone array.
03-05-2019
5
For example, when talking between three parties present in a remote place using this sound
reproducing device, the position of each user is at the position of the vertex of an equilateral
triangle having a side of a predetermined length as a virtual positional relationship. is assumed.
Therefore, the second loudspeaker and the microphone array are arranged to reproduce such
positional relationship.
[0016]
According to the third invention, since the loudspeaker and the microphone array are arranged
to reproduce a virtual positional relationship, when using an audio filter generated on the basis
of the impulse response measured in this positional relationship. You can get a sense of realism
as if you are speaking in a certain place in that position.
[0017]
A fourth invention is according to any one of the first to third inventions, wherein the
microphone array is disposed in a certain sound field, records sound field data detected by the
microphone array, and the sound field data The system further includes a server that performs
convolution processing and transmits the sound to each sound reproduction device, each sound
reproduction device further includes a second data receiving unit that receives the sound field
data transmitted from the server, and the sound output unit (2) The sound field data received by
the data receiving means is superimposed on the sound data subjected to the convolution
process by the sound processing means and output to the speaker array.
[0018]
In the fourth invention, the microphone array is arranged in a certain sound field.
The sound reproduction system further comprises a server (12).
The server records sound field data detected by the microphone array, performs convolution
processing on the sound field data, and transmits the data to each sound reproducing device.
Each sound reproduction device further comprises a second data receiving means. The second
data receiving means receives the sound field data transmitted from the server. The sound output
03-05-2019
6
means superimposes the sound field data received by the second data receiving means on the
sound data subjected to the convolution processing by the sound processing means, and outputs
the sound data to the speaker array. Therefore, while a certain sound field is reproduced, the
sound from other sound reproduction devices is reproduced.
[0019]
According to the fourth aspect of the invention, for example, users who are in conversation using
the sound reproduction device can talk while sharing the sound field.
[0020]
A fifth invention is according to the fourth invention, wherein the speaker array has a first
predetermined number of first loudspeakers, the microphone array has a second predetermined
number of microphones, and has high linear independence. , Speaker selection means for
selecting a third predetermined number of first loudspeakers less than the first predetermined
number, and microphone selection means for selecting a fourth predetermined number of
microphones having a high degree of linear independence, less than the second predetermined
number And the server records the sound field data using the fourth predetermined number of
microphones and performs convolution processing, and the sound output unit generates the
third predetermined sound field data received by the second data receiving unit. Output using a
number of first loudspeakers.
[0021]
In a fifth aspect, the speaker array has a first predetermined number of first loudspeakers, and
the microphone array has a second predetermined number of microphones.
The speaker selection means selects a third predetermined number of first loudspeakers having a
high degree of linear independence and less than the first predetermined number.
Similarly, the microphone selection means selects a fourth predetermined number of
microphones having a high degree of linear independence, which is less than the second
predetermined number. Therefore, the server records sound field data using the fourth
predetermined number of microphones and performs convolution processing. Further, the sound
output means outputs the sound field data received by the second data receiving means using a
third predetermined number of first loudspeakers.
03-05-2019
7
[0022]
According to the fifth invention, since the number of loudspeakers and microphones used is
reduced, the processing load of convolution can be reduced and the amount of data transmission
can be reduced. Therefore, it is possible to share the sound field and to talk in real time. In
addition, since the loudspeakers and the microphones, which have high linear independence, are
respectively selected, there is no loss of sense of presence even if their number is reduced.
[0023]
A sixth aspect of the present invention is a speaker array having a plurality of loudspeakers, filter
storage means for storing audio filter data corresponding to audio filters provided at each angle,
and sound data corresponding to sounds generated by a user Sound detection means, angle
detection means for detecting angle data corresponding to the direction in which the user
generated the sound with reference to the direction of the other user, sound data detected by the
sound detection means and the angle detection means Data transmission means for transmitting
different angle data to another sound reproduction device, data reception means for receiving
sound data and angle data from another sound reproduction device, and an angle indicated by
the angle data received by the data reception means The voice filter data corresponding to the
voice data is read out from the filter storage means, and data is read using the voice filter
corresponding to the read voice filter data. Comprising a sound output means for outputting
sound processing unit performs a convolution process on the sound data received by the signal
unit, and sound data convolution processing is performed by the sound processing means in the
loudspeaker array, a sound reproduction device.
[0024]
A seventh invention is a sound reproduction system according to a sound reproduction system,
comprising a plurality of sound reproduction devices including a speaker array having a plurality
of loudspeakers and filter storage means for storing sound filter data corresponding to sound
filters provided at each angle. In the method, each sound reproducing device detects (a) sound
data corresponding to a sound generated by the user, and (b) a direction in which the user
generates a sound based on the directions of the other users. (C) transmit the sound data
detected in step (a) and the angle data detected in step (b) to another sound reproduction device,
and (d) another sound Sound data and angle data from the reproduction device are received, and
(e) voice filter data corresponding to the angle indicated by the angle data received in step (d) is
read out from the filter storage means and read. The sound data received in step (d) is subjected
to a convolution process using an audio filter corresponding to the output audio filter data, and
03-05-2019
8
(f) the sound data subjected to the convolution process in step (e) is a speaker array It is a sound
reproduction method output to.
[0025]
Also in the sixth and seventh inventions, it is possible to know the direction of the generator of
the sound by the reproduced sound.
[0026]
According to this invention, since the voice filter according to the angle of the sound generator is
used, the direction of the sound generator can be known by the reproduced sound.
Therefore, for example, when a plurality of people present in different places talk using a sound
reproduction device, it is possible to know who is talking to whom by means of the reproduced
sound and to talk smoothly. Can.
[0027]
The above object, other objects, features and advantages of the present invention will become
more apparent from the detailed description of the following embodiments given with reference
to the drawings.
[0028]
FIG. 1 is an illustrative view showing one example of a sound field sharing system of the present
invention.
FIG. 2 is an illustrative view showing an example of the microphone array shown in FIG.
FIG. 3 is an illustrative view showing an example of a speaker array system used for the sound
field sharing system shown in FIG.
03-05-2019
9
FIG. 4 is an illustrative view for explaining the principle of sound field reproduction. FIG. 5 is an
illustrative view for explaining a Gram-Schmidt orthogonalization method. FIG. 6 is a graph
showing changes in the average value and the minimum value of the evaluation index when
selecting 24 loudspeakers for 62 microphones when each loudspeaker is selected first. FIG. 7 is
an illustrative view showing an arrangement position of selected 24 loudspeakers when the 60th
loudspeaker is selected first. FIG. 8 is an illustrative view showing an arrangement position of
eight microphones selected for the selected 24 loudspeakers. FIG. 9 is a schematic view of the
usage state of the speaker array system used for the sound field sharing system of FIG. 1 as
viewed from the upper side. FIG. 10 is an illustrative view showing a virtual positional
relationship in the case of conversation among three parties using the sound field sharing system
shown in FIG. FIG. 11 is a top view of the real environment in which the impulse response of the
voice according to the direction of the speaker's face is detected in the virtual positional
relationship shown in FIG. FIG. 12 is a graph showing an impulse response detected by a certain
microphone of the microphone array and an impulse response attenuated using a Hanning
window. FIG. 13 is an illustrative view showing a position and an orientation of a speaker and a
listener and an illustrative view showing them using a loudspeaker and a microphone array. FIG.
14 is a graph showing an average of angle errors by subjective evaluation of subjects in an
experiment in a real environment and a reproduction environment. FIG. 15 is a bar graph
showing an angle error average according to a subject's subjective evaluation for each angle at
which a speaker faces in an experiment in a real environment and a reproduction environment.
FIG. 16 is a view showing the correlation of the angular error average according to the
subjectivity between the real environment and the reproduction environment.
[0029]
Referring to FIG. 1, the sound field sharing system 10 of this embodiment also functions as a
sound reproduction system, and includes a server 12. The server 12 is a general-purpose server,
and the microphone array 14 is connected to the server 12. The server 12 is also connected to
the computer 18, the computer 26 and the computer 34 via a network 16, such as the Internet
and / or a LAN. The computers 18, 26, 34 are general purpose PCs or workstations. A speaker
array system 20, a microphone 22 and a camera 24 are connected to the computer 18. Further, a
speaker array system 28, a microphone 30 and a camera 32 are connected to the computer 26.
The speaker array system 36, the microphone 38 and the camera 40 are also connected to the
computer 34.
[0030]
03-05-2019
10
The sound field sharing system 10 shown in FIG. 1 includes three BoSC reproduction systems
10a, 10b and 10c. As surrounded by a dotted line frame in FIG. 1, the BoSC reproduction system
10 a is configured by a server 12, a microphone array 14, a network 16, a computer 18, a
speaker array system 20, a microphone 22 and a camera 24. Further, the BoSC reproduction
system 10 b is configured by the server 12, the microphone array 14, the network 16, the
computer 26, the speaker array system 28, the microphone 30, and the camera 32 so as to
surround by a dashed dotted line in FIG. Furthermore, the BoSC reproduction system 10 c is
configured by the server 12, the microphone array 14, the network 16, the computer 34, the
speaker array system 36, the microphone 38, and the camera 40 so as to be surrounded by a
two-dot chain line frame in FIG.
[0031]
However, the computer 18 and the speaker array 20, the computer 26 and the speaker array 28,
the computer 34 and the speaker array 36 respectively correspond to the sound field data
detected by the microphone array 14 or other BoSC systems 10a, 10b, 10c. It functions as a
sound reproduction device for reproducing audio data or both of them.
[0032]
As shown in FIG. 2, the microphone array 14 includes a skeleton 14a having a nearly spherical
shape and a stand 14b supporting the skeleton 14a.
The framework 14a has 70 vertices cut from the bottom 10 vertices based on the structure of
C80 fullerene (Fullerene). Although not shown, it is a surface (outer surface) of the skeleton 14a,
and one omnidirectional microphone is attached to each of 70 apexes. For example, DPA 4060BM can be used as a microphone. The stand 14b is constituted by a support shaft 140 and a
tripod 142, and the support shaft 140 supports the ceiling of the frame 14a from the inside
through the cut bottom of the frame 14a.
[0033]
Although the skeleton 14a can be seen from the front even on the back side except for a portion
overlapping with the front side, a portion corresponding to the back side is shown by a dotted
line in FIG. 2 for easy understanding.
[0034]
03-05-2019
11
In addition, as shown in FIG. 3, the speaker array systems 20, 28, 36 include an elliptical dome
portion 220 and four pillar portions 222 supporting the same.
The oval dome portion 220 is configured of, for example, a four-layered wooden mount 220a,
220b, 220c, 220d. However, in FIG. 3, it is the figure which looked at the inside of the dome part
220 from the diagonally downward, and is showing the one part about the mount frame 220d
and the column part 222. FIG. Although illustration is omitted, the interior of the dome portion
220 and the column portion 222 is hollowed out, and the mounts (220a to 220d) themselves
play a role of a closed chamber type enclosure.
[0035]
Also, 70 loudspeakers 230 are installed in each of the speaker array systems 20, 28, 36.
Specifically, six full-range units (Fostex FE 83E), ie, loudspeakers 230, are installed on the gantry
220a, sixteen loudspeakers 230 are installed on the gantry 220b, and twenty-four loudspeakers
230 are installed on the gantry 220c. Are installed, and 16 loudspeakers 230 are installed on the
rack 220d. Furthermore, in each of the four pillars 222, two subwoofer units (Fostex FW 108N)
or loudspeakers 230 are installed in order to compensate for the low range.
[0036]
Such speaker array systems 20, 28, 36 are respectively installed in a sound field reproduction
room (not shown). The sound reproduction room is a 1.5-inch soundproof room, and a YAMAHA
woody box (sound insulation performance Dr-30) is used. In addition, a chair with a lift (not
shown) is provided in the sound reproduction room. This is located in the dome portion 220 of
the speaker array system 20, 28, 36, and sets the position (height) of the user's ear sitting in a
chair at the height of the mount 220c where the number of loudspeakers 230 is maximum In
order to
[0037]
As to the sound field reproduction room (sound field reproduction system) including the
03-05-2019
12
microphone array 14 and the computers (18, 26, 34) and the speaker array system (20, 28, 36),
see 1. Numerical analysis technology and visualization and visualization 1.7 Three-dimensional
sound field communication system Satoshi Enomoto Acoustic technology No. 148 / Dec. 2009
Since it is disclosed in pp37-42, further detailed explanation will be omitted. .
[0038]
For example, in the sound field sharing system 10 shown in FIG. 1, the microphone array 14 is
arranged in a sound field such as a performance room of an orchestra. The server 12 converts a
sound field signal input from the microphone array 14 through an amplifier (not shown) into
digital sound field data, and performs inverse system convolution processing on the sound field
data. The server 12 transmits the sound field data on which the convolution process has been
performed to the computers 18, 26 and 34 via the network 16.
[0039]
The computers 18, 26, 34 respectively convert the sound field data from the server 12 into
analog sound field signals and output them to the speaker array systems 20, 28, 36. Therefore,
in the speaker array systems 20, 28, 36, the above-described sound field is reproduced. Thus,
each user (not shown) using speaker array systems 20, 28, 36, for example, plays through
speaker array systems 20, 28, 36, even if they are present at remote locations. You can enjoy the
live orchestra recorded at the venue.
[0040]
Also, each user can input voice through the microphones 22, 30, 38. The audio signal detected
by the microphone 22 is converted into digital audio data by the computer 18 and transmitted to
the computers 26, 34 via the network 16. The computer 26 performs a convolution operation on
the received audio data and the audio filter, converts it into an analog audio signal, and outputs it
to the speaker array system 28. Similarly, the computer 34 performs a convolution operation on
the received voice data and voice filter, converts it into an analog voice signal, and outputs it to
the speaker array system 36. However, each of the computers 26 and 34 superimposes sound
field data and audio data, and converts the superimposed data (hereinafter referred to as
sound data ) into an analog signal (hereinafter referred to as sound signal ). The same
applies below. Therefore, the sound field is reproduced, and the voices of other users are
03-05-2019
13
reproduced.
[0041]
Further, the audio signal detected by the microphone 30 is converted into digital audio data by
the computer 26 and transmitted to the computers 18 and 34 via the network 16. The computer
18 performs a convolution operation on the received audio data and the audio filter, converts it
into an analog audio signal, and outputs it to the speaker array system 20. Similarly, the
computer 34 performs a convolution operation on the received voice data and voice filter,
converts it into an analog voice signal, and outputs it to the speaker array system 36. That is,
each of the computers 18 and 34 converts sound data in which sound field data and sound data
are superimposed into a sound signal.
[0042]
Further, the audio signal detected by the microphone 38 is converted into digital audio data by
the computer 34 and transmitted to the computers 18, 26 via the network 16. As described
above, each of the computers 18 and 26 performs a convolution operation on the received voice
data and voice filter, converts it into an analog voice signal, and outputs it to the speaker array
systems 20 and 28.
[0043]
Therefore, the user of the speaker array system 20, the user of the speaker array system 28, and
the user of the speaker array system 36 can share the sound field and talk among three parties.
[0044]
In addition, although detailed description is abbreviate ¦ omitted, the microphone of a headset
can be used as microphone 22, 30, 38, for example.
[0045]
Also, although detailed description will be omitted, each computer 18, 26, 34 convolutes the
audio data from the other computers 18, 26, 34 using separate audio filters.
03-05-2019
14
For example, each computer 18, 26, 34 can identify the other computer 18, 26, 34 by the
communication port and IP address used.
[0046]
Here, the principle of BoSC and the sound field reproduction system using BoSC will be briefly
described.
In boundary sound field control, based on the Kirchhoff-Helmholtz integral equation (KHIE), the
sound field in region V in the original sound field shown on the left side of FIG. 4 is reproduced
in region V 'in the current sound field when shown on the right side in FIG. Be done. However,
the relative positions of the recording point r on the boundary S surrounding the region V and
the control point r ′ on the boundary S ′ surrounding the region V ′ are equal. That is, it is
assumed that Equation 1 holds. However, point s and point s' are arbitrary points inside each
area.
[0047]
[Equation 1] ¦ r-s ¦ = ¦ r '-s' ¦, s ∈ V, s' ∈ V' At this time, the sound pressure p (s), p (s') in the
region not including the sound source inside Is shown by Equation 2 and Equation 3 respectively
from KHIE.
[0048]
[0049]
[0050]
Where ω is the angular frequency, ρ 0 is the density of the medium, and p (r), vn (r) are the
sound pressure at point r on the boundary and the particle velocity in the direction of normal n,
respectively r ¦ s) is a free space Green's function.
[0051]
03-05-2019
15
Here, according to Equation 1, the relationship shown in Equation 4 is established.
Furthermore, according to Eq. 4, Eq. 5 is established.
[0052]
[0053]
[0054]
If the signal is output from the secondary sound source so that the sound pressure on the
boundary surface S picked up by the original sound source and the particle velocity become
equal in the reproduced sound field, the sound field in the area V is an area It can be seen that V
'is reproduced.
[0055]
However, the output of the secondary sound source is determined by convoluting the signal
observed at the recording point with an inverse filter that cancels the transfer characteristics
from all the secondary sound sources to all the control points.
Therefore, in order to realize a BoSC sound field reproduction system as shown in FIG. 4, it is
important to design a stable and robust inverse filter (pinv (H)).
[0056]
In addition, the design method of the inverse filter is described in the literature (S. Enomoto et al.,
"Three-dimensional sound field reproduction and recording systems based on boundary surface
control principle", Proc. Of 14th ICAD, Presentation o 16, 2008 Jun.
03-05-2019
16
Will be briefly described here as it is disclosed in detail.
[0057]
A method of designing a multichannel multipoint control inverse system (hereinafter simply
referred to as inverse system ) having secondary sound source number M and control point N
as shown in FIG. 4 in a frequency domain will be briefly described.
Here, the inverse system is a generic name of M × N inverse filter groups.
[0058]
Assuming that the transfer function from the secondary sound source i to the control point j is
Hji (ω), the input signal is Xj (ω), and the observation signal is Pj (ω), these relations are
expressed by Equation 6 Can.
Where i is a secondary sound source number (1, 2,..., M), j is a control point number (1, 2,..., N),
and W (ω) is an inverse system.
[0059]
[0060]
At this time, in order to set P (ω) = X (ω), it is necessary to satisfy Expression 7.
However, + means a pseudo inverse matrix.
By this, [W (ω)] is defined as an inverse system of [H (ω)].
03-05-2019
17
[0061]
Here, it is well known that the regularization method is a rational method for solving the inverse
problem. [Equation 7] [W (ω)] = [H (ω)] <+>
This has already been applied to sound reproduction systems (TOKUNO et al., "Inverse Filter of
Sound Reproduction Systems Using Regularization" EIEIC TRANS. FUNDAMENTALS, Vol. E80-A,
NO.5 MAY 1997, etc.). Calculated inverse matrix [W ^ (ω)] for rank ([H (ω)]) = N by using the
regularization method ( ^ is shown beside W for convenience of notation Is actually described
above W, as shown in equation 8. same as below. ) Is given by equation 8. However, in Equation
8, # means conjugate transpose, -1 means inverse matrix, β (ω) is a regularization parameter,
and IM is an M × M unit matrix. The same applies below.
[0062]
[0063]
On the other hand, the inverse matrix [H (ω)] <+> for rank ([H (ω)]) = M shown on the right side
of Eq. 7 is derived as Eq.
[0064]
[0065]
Equations 8 and 9 are interpreted as a least square solution and a least norm solution (norm
minimal generalized inverse matrix), respectively.
However, rank ([H (ω)]) = N = M, [H (ω)] is not a singular matrix (non-regular matrix), and [W
(ω)] = [H (ω)] < -1> is given.
Also, the time domain inverse filter coefficients are obtained from the inverse discrete Fourier
transform of [W ^ (ω)].
03-05-2019
18
[0066]
In the BoSC reproduction system, the arrangement of the loudspeakers 230 of the speaker array
system (20, 28, 36) and the arrangement of the microphones of the microphone array 14 affect
spatial sampling.
[0067]
In the equations (8) and (9), the instability of the inverse system can be mitigated (removed) by
selecting an appropriate regularization parameter β (ω).
In this embodiment, the regularization parameter β (ω) is defined in the frequency band of each
observer.
Furthermore, the inverse filter was calculated by using the impulse response previously
measured between the set of each loudspeaker 230 and each microphone of the microphone
array 14 in a soundproof room. Because the measured impulse response was used, it did not
follow the fluctuations caused by environmental changes. However, in a fluctuating actual
environment, a Multiple-Input Multiple-Output (MIMO) adaptive inverse filter can be applied to
the BoSC regeneration system.
[0068]
Here, when the microphone array 14 and the speaker array systems 20, 28, and 36 shown in
FIGS. 1 to 3 are used as they are, the processing load on the server 12 is considerably large.
Specifically, since the microphone array 14 is 70ch and the speaker array systems 20, 28 and 36
are 62ch, the server 12 converts the sound field signal (sound field data) of each microphone of
the microphone array 14 and the reverse system And 62.times. 70 times of the convolution
process with each other, and each time the convolution process needs to be performed for the
number of taps of the inverse system (in this embodiment, 2048 points.times.2 taps = 4096).
[0069]
Further, since the amount of sound field data to be transmitted (data amount) is enormous, each
client (computers 18, 26, 34) needs a band of about 45 Mbps.
03-05-2019
19
[0070]
Furthermore, even when the computer 18, 26, 34 performs a convolution operation of voice data
corresponding to the user's voice and a voice filter, the processing load becomes relatively large
when the 70ch is fully used.
[0071]
Therefore, it is difficult to transmit sound field data in real time from the server 12 to the
computers 18, 26, 34, and it goes without saying that users using the speaker array systems 20,
28, 36 enjoy the orchestra etc in real time. It is also difficult.
In other words, it is not possible to share the sound field in real time.
Also, you can not talk in real time.
[0072]
In order to avoid this, for example, by reducing the number of microphones used in the
microphone array 14 and the number of loudspeakers 230 used in the speaker array systems 20,
28, 36, the processing load of the convolution process and the amount of data to be transmitted
are reduced. It is conceivable to reduce. However, it does not mean that the number of
microphones and loudspeakers 230 used is merely reduced, and it is necessary not to impair the
sense of reality of the reproduced sound field.
[0073]
Therefore, in this embodiment, the microphones and loudspeakers 230 used are reduced without
losing the sense of reality.
[0074]
03-05-2019
20
In this embodiment, first, when the 70-ch microphone array 14 is used, the loudspeakers 230
used in the speaker array system 22 are extracted (selected) using the Gram-Schmidt
orthogonalization method.
Then, when using the selected loudspeaker 230, microphones to be used in the microphone
array 14 are extracted (selected) using Grams-Schmidt orthogonalization.
[0075]
Although detailed description is omitted, extraction (selection) of the loudspeaker 230 and the
microphone to be used can be performed using the server 12, the computer 18, 26, 34 or
another computer not shown.
[0076]
Here, a basic algorithm is described for selecting the loudspeaker 230 using a Gram-Schmidt
orthogonalization method for a single frequency.
If the linear independence from the N-dimensional vertical vectors included in N × M is low, the
determinant is said to be in a bad state. Deterioration of linear independence in [H (ω)] causes
instability of the BoSC regeneration system 10a, 10b, 10c. Here, [H (ω)] shown in Equation 6 can
be written as Equation 10.
[0077]
[Equation 10] P (ω) = [H (ω)] Y (ω) <> = {h1 (ω),..., HM (ω)} Y (ω) However, Y (ω) = [W (ω) )]
X (ω) and hi (ω) are N-dimensional vertical vectors included in [H (ω)]. This longitudinal vector
h (ω) is the transfer function between a certain loudspeaker 230 and each of the microphones of
the microphone array 14 at the frequency ω. Therefore, the choice of loudspeaker 230 using
Gram-Schmidt's orthogonalization means selecting the set of longitudinal vectors h (ω) with high
linear independence from [H (ω)]. An algorithm of the Gram-Schmidt orthogonalization method
will be briefly described below.
[0078]
03-05-2019
21
In the nth step of selecting the loudspeakers 230, n-1 loudspeakers 230 have already been
selected. A set of vertical vectors included in [H] is indicated by τ = {h1,..., HM}. Sn-1 indicates a
subset of vectors selected by the n-1st step, and .tau.n-1 indicates a subset of unused vectors by
the n-1th step. vn-1 = {v1,..., vn-1} represents the orthonormal basis of the plane spanned by the
subset Sn-1.
[0079]
For example, in the first step, one loudspeaker 230 of all the loudspeakers 230 is selected as a
reference loudspeaker 230, and all loudspeakers 230 other than the reference loudspeaker 230
are evaluated loudspeakers 230 It is selected as a loudspeaker 230). As described later,
according to the Gram-Schmidt orthogonalization method, one evaluation target loudspeaker 230
is selected from the plurality of evaluation target loudspeakers 230 in relation to the reference
loudspeaker 230. In the next step, again in accordance with the Gram-Schmidt orthogonalization
method, the remaining plurality of evaluated loudspeakers 230 in relation to the first selected
reference loudspeaker 230 and the evaluated loudspeaker 230 selected in the previous step. ,
And one evaluation target loudspeaker 230 is selected. That is, in this step, it can be said that the
evaluation object loudspeaker 230 selected in the previous step is the reference loudspeaker
230. This is repeated.
[0080]
However, the eight loudspeakers 230 that compensate for the low frequency range are excluded
from the reference loudspeaker 230 and the evaluation target loudspeaker 230.
[0081]
FIG. 5 is an example of a plane spanned by the subset Sn-1.
In the n-th step, hn ^ (in fact, "^" is written above h, as shown in Equation 11, for the plane
spanned by the subset Sn-1. same as below. Hn ^ is chosen so that the vertical component of) is
maximized. The vertical component ri of an arbitrary vector hi included in the subset τ n-1 is
expressed by Equation 11.
03-05-2019
22
[0082]
Where p represents a projection on the plane spanned by the subset Sn−1. The n-th loudspeaker
230 is determined such that the norm of the vertical component ri is maximized, for example, as
shown in equation 12.
[0083]
[0084]
However, J (hi), which is the value of the evaluation index, is defined by equation 13.
[0085]
[Equation 13] The vertical component of J (hi) = ¦¦ ri ¦¦ ^ is rn ^ (in fact, the symbol of
written on r.
^
is
same as below.
In the case indicated as), the nth orthonormal vector vn is determined according to Eq.
[0086]
[0087]
The value Jn ^ of the evaluation index maximized in the n-th step (in fact, the symbol of "^" is
written above J).
same as below. Is shown by equation 15.
03-05-2019
23
[0088]
[0089]
The process according to Equation 11 to Equation 15 is repeated until the value Jn ^ of the
evaluation index becomes smaller than the preset threshold value Jthr ^.
However, for the frequency band [ω 1, ω h], the values of the two evaluation indices are
obtained according to Eq.
[0090]
[0091]
However, hi == {hi (ωl), ..., hi (ωh)} (in fact, as shown in equation 16,
is written above h),
and K is a discrete frequency is the number of ω k, where a k denotes an arbitrary weighting
factor for the discrete frequency ω k.
The vertical component ri (ωk) and the orthonormal vector vi (ωk) are obtained separately for
each discrete frequency, as in the case of a single frequency. In the optimization process, the
metric value Javg is maximized. On the other hand, the value Jmin of the evaluation index is used
to determine the end of the optimization process. That is, the selection of the loudspeaker 230 is
ended when Jmin ^ <Jthr ^.
[0092]
However, for the optimization process, refer to the literature (Asano, Suzuki, and Swanson
"Optimization of control source configuration in active control systems using Gram-Schmidt
orthogonalization", Speech and Audio Processing, IEEE Transactions on, Mar. 1999)Is disclosed
in
03-05-2019
24
[0093]
In this document, when the value of the evaluation index is equal to or higher than the threshold
(Jmin ^ 以上 Jthr ^), the selection of the loudspeaker 230 is continued. However, no method for
determining an appropriate threshold has been identified. Therefore, in this embodiment, in the
sound field sharing system 10, the maximum number of loudspeakers 230 and the maximum
number of microphones of the microphone array 14 in the speaker array system (20, 28, 36)
capable of sharing the sound field in real time. And verified. Then, using the Gram-Schmidt
orthogonalization method, the numbers (arrangement positions) of the loudspeakers 230 up to
the maximum number were determined.
[0094]
Here, as described above, in the Gram-Schmidt orthogonalization method, the speaker position is
determined based on the previously selected speaker position, so that the selection result is the
first selected speaker position. It is strongly influenced.
[0095]
For example, a case was considered in which the number of loudspeakers 230 used was reduced
to about half (32), about 1/3 (24), and about 1/4 (16).
FIG. 6 shows changes in the values Javg and Jmin of the evaluation index when 24 loudspeakers
230 are selected (the selection process of 24 steps is performed). In FIG. 6, the horizontal axis
indicates the speaker position (see FIG. 10) of the loudspeaker 230 (reference loudspeaker 230)
selected first, and the vertical axis indicates the evaluation value (dB). However, among the two
solid lines, a thin solid line indicates the value Javg of the evaluation index, and a thin solid line
indicates the change of the value Jmin of the evaluation index.
[0096]
Although the detailed description is omitted, for example, the reference loudspeaker 230 selected
first is sequentially changed (2, 3,..., 62) from the 1 number (see FIG. 7) and selected for each
case. A set of 24 speaker positions (numbers of loudspeakers 230) is selected, and values Javg
03-05-2019
25
and Jmin of the evaluation index are calculated for each set. However, the selected set of 24
speaker positions (numbers of the loudspeakers 230) and the values Javg and Jmin of the
evaluation index calculated for each set are the memory (not shown) of the computer described
above, but the hard disk Stored in RAM). Then, as described later, among the plurality of sets, one
set of evaluation index values Javg and Jmin satisfying a predetermined condition is selected.
Therefore, the sound field is reproduced using the selected set of 24 loudspeakers 230.
[0097]
Also, the free space Green's function was used to obtain the transfer function between each
loudspeaker 230 of the speaker array system (20, 28, 36) and the microphones of the
microphone array 14. The upper frequency limit for the stimulation described later was not
limited here. However, the configuration (setting) of the loudspeaker 230 was determined in the
range of 20 Hz to 1 kHz, with a frequency of every 20 Hz. Although illustration is omitted, when
the upper limit frequency is not limited, many loudspeakers 230 disposed in the upper layer (the
mount 220a and the mount 220b) are selected. It is difficult in a three-dimensional sound
reproduction system to integrate wavefronts coming from a direction in which there are no
loudspeakers 230 at all. Thus, the loudspeakers 230 should be located in every possible direction
enclosed by the microphone array 14.
[0098]
As described above, FIG. 6 is a graph in which the values Javg and Jmin of the evaluation index
when the selection process of 24 steps (times) is performed for the loudspeaker 230 are shown
by broken lines. As can be understood from FIG. 6, the evaluation index value Javg, in the case
where the loudspeaker 230 whose speaker position is 60 (see FIG. 7) is selected first and 24
loudspeakers 230 in all are selected. Jmin is the largest.
[0099]
In this embodiment, among a plurality of sets (62 sets in this embodiment), a set of 24
loudspeakers 230 which satisfy the predetermined condition of the value Javg, Jmin of the
evaluation index is selected. Specifically, a set having the largest value Javg of the evaluation
index is selected. However, when the value Jmin of the evaluation index for the set for which the
value Javg of the evaluation index is the maximum is extremely low, there is a frequency with low
03-05-2019
26
linear independence, so even if the value Javg of the evaluation index is the maximum, It is not
appropriate to choose. It is because it is considered that the sound field can not be reproduced
correctly. In such a case, a set having a large value Javg of the evaluation index is selected next.
However, if the value Jmin of the evaluation index for the set having the next largest evaluation
index value Javg is extremely low, then the set having the largest value Javg of the evaluation
index is selected. The same is true thereafter. For example, the computer determines whether the
evaluation index value Jmin is extremely low based on a preset threshold. The threshold is a
value set by the developer or user of the sound field sharing system 10. However, although
illustration is omitted, as the number of loudspeakers 230 to be selected increases, the
evaluation index values Javg and Jmin gradually decrease, so the threshold value is also variably
set according to the number of loudspeakers 230 to be selected. There is a need.
[0100]
According to the preliminary test results, the speaker array system, with the number of elements
in [W (ω)] within M × N = 192, from the performance of the server 12 and the performance of
the computers 18, 26, 34 and the communication speed limitation including the network 16. It
has been shown that the number (M) of loudspeakers 230 (20, 28, 36) and the number (N) of
microphones of the microphone array 14 should be determined. Therefore, as described above,
since the number (M) of loudspeakers 230 is determined to be "24", the number (N) of
microphones selected is at most "8".
[0101]
However, in this embodiment, the CPU (not shown) of the server 12 and the computers 18, 26,
34 is Xeon® QuadCore × 2, and the memory (not shown) is 4 GB. In addition, Windows
(registered trademark) XP 64 bit was adopted as the operating system for the server 12. In
addition, as a network 16 connecting the server 12 and the computers 18, 26, and 34, an ultrahigh-speed high-performance research and development test bed network (JGN2 plus: 1 Gbps)
and a LAN (100 Mbps) were used.
[0102]
Although illustration is omitted, in the preliminary experiment, the server 12 and the computer
18 are connected using the above-mentioned LAN, and the server 12 and the computers 26 and
34 are connected using the above-mentioned JGN2plus and LAN Ru.
03-05-2019
27
[0103]
In FIGS. 7A and 7B, as described above, the 24 loudspeakers in the case where the loudspeaker
230 with the speaker position of 60 is first selected and 24 loudspeakers 230 in total are
selected. The distribution of the positions of the speakers 230 is shown.
FIG. 7A is a schematic view of the arrangement of the loudspeakers 230 as viewed from directly
above, and FIG. 7B is a schematic view of the arrangement of the loudspeakers 230 as viewed
from the side. That is, FIG. 7A shows the distribution of the loudspeaker 230 in the horizontal
direction, and FIG. 7B shows the distribution of the loudspeaker 230 in the vertical direction.
[0104]
As understood from FIG. 7 (B), in the distribution shown in FIG. 7 (A), the value in the height
direction (Z direction) increases as the position of the speaker goes to the center. That is, the
speaker positions of the loudspeaker 230 provided on the gantry 220 a are 1 to 6 .
Moreover, the speaker position of the loudspeaker 230 provided in the mount 220b is "7"-"22".
Furthermore, the speaker positions of the loudspeaker 230 provided on the gantry 220c are
23 to 46 . And the speaker position of the loudspeaker 230 provided in the mount 220d
is "47"-"62".
[0105]
The eight loudspeakers 230 provided in the four pillars 222 in order to compensate for the low
range are not shown in FIGS. 7A and 7B because they are not targets of selection.
[0106]
Further, in FIGS. 7A and 7B, the negative direction of the Y axis is the front where the user's face
is facing, and the positive direction of the Y axis is the rear where the user's occipital head is
facing.
Furthermore, as shown in FIG. 7A, the negative direction of the X axis is to the right of the user,
03-05-2019
28
and the positive direction of the X axis is to the left of the user. Then, as shown in FIG. 7B, the
negative direction of the Z axis is below the position of the user's ear, and the positive direction
of the Z axis is above the position of the user's ear.
[0107]
In FIG. 7A, a shaded pattern is added to a circle (circle with 60 described) indicating the
speaker position of the loudspeaker 230 selected first. Also, subsequently, a circle indicating the
speaker position of the loudspeaker 230 selected as a result of repetition based on the GramSchmidt orthogonalization method (here, 1 to 6 , 7 , 9 11 , 13 , 15 ,
17 , 19 , 21 , 23 , 31 , 35 , 48 , 51 , 54 , A hatched pattern is
attached to 56 , 58 and 62 . Furthermore, the unmarked circles indicate the speaker
positions of the loudspeakers 230 not selected.
[0108]
Further, in FIG. 7B, different figures (circles, triangles, squares, and rhombuses) are shown
depending on the position of the loudspeaker 230 to be arranged in the Z direction. Further, in
FIG. 7 (B), the speaker position of the loudspeaker 230 selected first is indicated by the position
of a black figure. And in FIG. 7 (B), the speaker position of the loudspeaker 230 selected after the
2nd is shown by the position of the figure which attached the gray.
[0109]
From FIGS. 7A and 7B, loudspeakers 230 distributed in each direction and height are regularly
observed. As shown in FIG. 7A, when the distribution of the loudspeakers 230 is viewed in a plan
view from directly above, the selected loudspeakers 230 are distributed substantially
symmetrically in each of the longitudinal direction and the lateral direction. I understand that
This is the same as when the distribution of the loudspeakers 230 is viewed in a plan view from
the side as shown in FIG. 7B.
[0110]
03-05-2019
29
The microphones were selected by applying the Gram-Schmidt orthogonalization method
described above by replacing the configuration of the loudspeakers 230 of the speaker array
system (20, 28, 36) and the microphones of the microphone array 14. However, since the
selection method using the Gram-Schmidt orthogonalization method has already been described,
redundant description will be omitted.
[0111]
FIG. 8 shows an arrangement of eight microphones selected for the arrangement of 24
loudspeakers 230 shown in FIGS. 7A and 7B. Although illustration is omitted, the positions of the
microphones are assigned numbers in the same manner as the speaker positions of the
loudspeakers 230. Although it is somewhat difficult to understand in FIG. 8, when the XY plane is
viewed in plan from directly above, the selected microphones are uniformly distributed in all
directions.
[0112]
Thus, although the number of microphones and loudspeakers 230 was reduced by using the
Gram-Schmidt orthogonalization method, in order to evaluate the influence of the reduction, a
sound source localization test in a horizontal plane was performed. . About the method and
evaluation result of this sound source localization test, "Optimization of loudspeakers and
microphone configurations for sound reproduction system based on boundary surface control
principle-An optimizing approach using Gram-Schmidt orthogonalization published in August
2010 by the inventors etc. Since it is disclosed in and its evaluation-, the description will be
omitted. As described above, as a result of the sound source localization test, the number of
loudspeakers 230 is determined to be 24, and the number of microphones is determined to be 8
due to the limitations of the performance of the server 12 and the like and the communication
speed.
[0113]
Although detailed description is omitted, a sound field signal detected by the selected
microphone is provided from the microphone array 14 to the server 12. At this time,
microphones not selected are disabled. That is, the server 12 does not detect sound field signals
from microphones not selected. On the other hand, the computers 18, 26, 34 output sound field
03-05-2019
30
data and audio data only to the selected loudspeaker 230.
[0114]
As described above, in this embodiment, in each of the speaker array systems 20, 28, 36, audio
data corresponding to audio generated by other users is output (reproduced) together with the
sound field data. Therefore, without considering the direction of the speaker's face, the computer
18, 26, 34 can simply convolute the voice data and the voice filter received from the other
computers 18, 26, 34, who is who It is difficult to recognize if you are talking. For example, it is
conceivable that the speaker utters his name and the name of the listener (the other party) every
time, but it is not a natural conversation.
[0115]
Therefore, in this embodiment, a voice filter in which the direction of the speaker's face (direction
of speech) is taken into consideration is used. Briefly, an audio filter is used that takes into
account the transfer characteristics of the audio signal (in this example, the audio signal).
[0116]
Although omitted in FIG. 3, as shown in FIG. 1, the BoSC reproduction systems 10a, 10b, and 10c
have cameras 24, 32, and 40, respectively. As shown in FIG. 9, the camera 24 is attached to the
mount 220 d of the speaker array system 20 so that the lens (shooting direction) faces the user
when the user using the speaker array system 20 faces the front. .
[0117]
In FIG. 9, the 24 loudspeakers 230 selected as described above are schematically shown so as to
equally surround the user's surroundings.
[0118]
Also, similar to the camera 24, the cameras 32, 40 are attached to the mount 220d of the
speaker array systems 28, 36, respectively.
03-05-2019
31
[0119]
Furthermore, as mentioned above, the user is wearing the microphones 22, 30, 38 of the
headset.
This is to prevent the sound output from the loudspeaker 230 from being detected by the
microphones 22, 30, 38 as much as possible, and to detect only the sound generated by the user.
[0120]
The computers 18, 26, and 34 analyze the images (face images) captured by the cameras 24, 32,
and 40 connected thereto, respectively, to determine the face direction of the user, that is, the
angle of the face with respect to the front direction.
The method of obtaining the direction of the face and the like from the face image is already
known, and thus the description thereof is omitted. For example, the technology disclosed in JPA-10-274516 can be used.
[0121]
However, the angle data transmitted to the other computers 18, 26, 34 is an angle with respect
to the direction of the face of the user (speaker) relative to the position of the other user
(listener). Therefore, after the direction of the face is obtained from the face image, it is converted
into an angle when the position (direction) of another user is set as a reference (0 °).
[0122]
In order to reflect the angle thus detected in the reproduced voice, the transfer characteristic of
the voice is detected, and as described above, the voice filter considering the transfer
characteristic is used. In this embodiment, the transfer characteristic of voice is detected.
03-05-2019
32
However, for simplicity, three parties using the sound reproduction system 10 have an
equilateral triangle in which each side has a predetermined length (2 m) in a certain space. It is
assumed to exist at the position of the vertex.
[0123]
That is, as shown in FIG. 10, the users A, B and C exist at the position of the apex of an equilateral
triangle having a side length of 2 m, and the front direction of each of the users A, B and C is
from the apex to the apex It is set in the direction to hang on the side opposite to. Therefore, in
this virtual positional relationship, when the user A speaks to the user B, the user A speaks 30 °
from the front direction to the right. Also, when the user A speaks to the user C, the user A
speaks by turning 30 ° from the front direction to the left. Although the description is omitted,
the same applies to the user B and the user C.
[0124]
In order to reproduce this virtual positional relationship, the transfer characteristics of speech
were detected at a certain place. FIG. 11 is a top view of an environment in which the transfer
characteristic of voice is detected. A certain place shown in FIG. 11 is a small conference room,
which has a rectangular shape of 10 m wide and 3.9 m long. However, as can be seen from FIG.
11, the small meeting room is slightly recessed inward at the upper left portion of the rectangle.
[0125]
Also, in the small meeting room, a loudspeaker 50 and a microphone array 52 for detecting the
transfer characteristic of voice are arranged. As the loudspeaker 50, for example, a speaker
(Yamaha MSP-3) capable of reproducing a sound approximating human-generated speech is
used. Further, as the microphone array 52, the same one as the microphone array 14 described
above is used. However, in order to distinguish between the case used for the sound reproduction
system 10 and the case used for detection of the transfer characteristic of speech, different
reference numerals are attached.
[0126]
03-05-2019
33
As can also be seen from FIG. 11, the microphone array 52 is disposed at the center of the lower
wall of the small meeting room. The loudspeaker 50 is rotated by 30 ° to the left when the front
direction of the microphone array 52 is the upward direction, and when the front of the
loudspeaker 50 faces the microphone array 52, the front of the loudspeaker 50 and the
microphone array It is arranged at a position where the distance to the center of 52 is 2 m. Then,
the loudspeaker 50 is rotated by one turn (360 °) in 15 ° increments at that position. A sweep
sound is output as a stimulus from the loudspeaker 50 every 15 °, and the impulse response
detected by each microphone m (m = 1, 2,..., M) of the microphone array 52 is transfer
characteristic Hang [m] As detected. However, in this embodiment, as described above, M = 70.
Ang is an angle that simulates the directivity of the sound source, and is an angle with respect to
the front direction of the users A, B, and C described above. However, in this embodiment, the
loudspeakers 50 are rotated counterclockwise (in clockwise direction) by 15 °. Furthermore, as
the sweep sound, a signal of up to 24 kHz created using the Time Stretched Pulse method was
used. Also, the reverberation time of this small meeting room is about 0.6 seconds.
[0127]
The reason why the loudspeakers 50 are rotated in steps of 15 ° is that the angle that can be
identified by human hearing is approximately 20 °.
[0128]
That is, in the case shown in FIG. 11, it is assumed that the loudspeaker 50 is a speaker, and the
listener is present so that the head (ear height) of the listener is at the center of the inside of the
microphone array 52. Characteristics are measured.
Therefore, in the virtual positional relationship as shown in FIG. 10, in order to detect the
transfer characteristic Hang [m] in all cases, the arrangement positions of the loudspeaker 50
and the microphone array 52 are reversed, or the loudspeakers 50 by moving it to a position
shown by a dotted line (a position rotated 30.degree. To the right from the front direction of the
microphone array 52) or reversing the arrangement position of the loudspeaker 50 and the
microphone array 52 shown by a dotted line. It is necessary to measure Hang [m]. However, in
this embodiment, for the sake of simplicity, the transfer characteristic Hang [m] is measured only
at the arrangement position of the loudspeaker 50 and the microphone array 52 shown by a
solid line in FIG. It is designed to be used in 34.
03-05-2019
34
[0129]
Here, in FIG. 12, the waveform of an impulse response detected by a certain microphone of the
microphone array 52 (herein, referred to as original impulse response in order to distinguish
from attenuated impulse response described later) is shown. It is shown by a dotted line. The
original impulse response includes early reflections and late reflections. As mentioned above, in
the small conference room as shown in FIG. 11, since the reverberation time is present, it takes
time to attenuate, and in order to reproduce this correctly, the length of the reverse filter is 2048
points Beyond. This makes it impossible to realize processing in real time. Therefore, in this
embodiment, by using the Hanning window, the length of the inverse filter is made not to exceed
2048 points. The impulse response attenuated by using the Hanning window is shown in FIG. 12
as a solid line. However, the Hanning window has the direct sound of the impulse response
recorded by each microphone at its center. Also, as can be seen from FIG. 12, this attenuated
impulse response contains enough of the early reflections and does not contain any of the late
reflections. However, even when the transfer characteristic Hang [m] based on the attenuated
impulse response is used, the positional relationship between the speaker and the listener as if
the user is speaking in the small meeting room shown in FIG. Can be reproduced almost
accurately.
[0130]
Although not shown, in each of the computers 18, 26, 34, data (transfer characteristic data)
corresponding to the transfer characteristic Hang [m] is stored in a memory (hard disk or RAM).
Therefore, the computers 18, 26, 34 read out the transfer characteristic data corresponding to
the angle ang indicated by the angle data transmitted from the other computers 18, 26, 34, and
transfer characteristic Hang [m corresponding to the read out transfer characteristic data. The
audio signal is reproduced using an audio filter in consideration of. Therefore, a voice having
directivity is reproduced.
[0131]
Here, it demonstrates concretely. An acoustic signal (in this embodiment, an audio signal
corresponding to the voice generated by the user) recorded by a single microphone 22 (30, 38)
is S. Also, let Ginv [s, i] be the inverse filter for the secondary sound source speakers s (s = 1, 2,...,
N) and control points i (i = 1, 2,..., M) in the BoSC playback system. . However, the arrangement of
the control points i is congruent with the microphone array 52, and m = i holds. Further, the
03-05-2019
35
secondary sound source speaker s is a loudspeaker 230, and in this embodiment, N = 24.
[0132]
As shown in FIG. 13A, assuming that the direction in which the listener is looking from the
speaker is θ, and the direction in which the speaker is facing is α, the direction (angle) of the
speaker with respect to the listener is α-θ Be done. Here, when the speaker and the listener
shown in FIG. 13A are represented by the above-mentioned loudspeaker 50 and microphone
array 52, they are shown as in FIG. 13B. Therefore, when the speech including the speech
direction is reproduced using the transfer characteristic Hang [m] of the angle ang = α−θ, the
output signal R (s) from the secondary sound source s in the BoSC reproduction system is
Indicated. However, V [s] is an audio filter in which the transfer characteristic Hang [m] is
considered.
[0133]
[0134]
That is, the computers 18, 26, 34 have data (voice filter data or transfer characteristic data)
corresponding to the voice filter V [s] or the transfer characteristic Hang [m] according to the
angle in an internal memory such as a RAM or a hard disk. Is stored, and the received voice data
is folded using the voice filter V [s] corresponding to the angle indicated by the angle data
received from the other computers 18, 26, and 34.
However, as described above, since the transfer characteristic Hang [m] is measured in steps of
15 °, it is in steps of 15 ° of the voice filter V [s]. Therefore, when selecting the voice filter V [s]
according to the angle indicated by the angle data, the angle indicated by the angle data is the
closest among 0 °, 15 °,..., 330 ° and 345 °. The voice filter V [s] is selected. However, if the
angle indicated by the angle data is a middle value between two adjacent angles, such as 7.5 °
and 22.5 °, the angle is selected from the two angles according to a predetermined rule. An
audio filter V [s] corresponding to one angle is selected. For example, as the predetermined rule,
it is conceivable to select one closer to the previous angle, or select one with a smaller (or larger)
angle, or randomly. Whichever rule is adopted, as described above, since it is within the range
that can be discerned by human hearing, no inconvenience occurs.
03-05-2019
36
[0135]
As described above, in this embodiment, since the voice filter V [s] having the transfer
characteristic Hang [m] is generated based on the impulse response measured in the small
conference room as shown in FIG. In this small meeting room, the user using 28, 36 can obtain a
sense of realism as if talking at the position of the apex of an equilateral triangle having a side
length of 2 m.
[0136]
Therefore, if the impulse response is detected at another place, it is possible to obtain a sense of
reality as if the user is speaking at the other place.
For example, if an impulse response is detected and a voice filter is generated in the audience at
the orchestra venue where the microphone array 14 is arranged, the sense of realism in
conversation while listening to the live orchestra at the orchestra venue You can get
[0137]
Here, in order to make a subjective evaluation of the speaker's face angle and voice reproduction,
the following experiment was conducted. In the experiment, as a stimulus to be output from the
loudspeaker 50 (stimulus sound), general greeting (in this case, "Hello") voice of men in their 30s
say was used. The subjects in the experiment are 10 Japanese people in their twenties or thirties.
However, 5 are women and 5 are men.
[0138]
Also, in this experiment, the angles used are reproduced in two environments to be described
later, that is, the actual environment (hereinafter, "real environment") and the sound field
reproduction system (speaker array system 20 (can be 28 or 36)). In both of the following
environments (hereinafter referred to as "reproduction environment"), the angle is 0.degree. To
90.degree. Counterclockwise and changed in steps of 15.degree. However, the 0 ° position is
aligned with the front of the loudspeaker 50 (speaker's face) facing the microphone array 52
(listener or subject). By using this angle range, it is possible to determine whether the speaker
03-05-2019
37
can acoustically perceive the listener who is speaking in the assumed three-way relationship
(virtual positional relationship).
[0139]
As mentioned above, in this example, subjective evaluation was performed in two environments.
The first is a subjective evaluation of the case where the sound is reproduced using the
loudspeaker 50 rotating in a real environment. The second is a subjective evaluation on the case
where the voice is reproduced by changing the angle within the above-mentioned angle range
using the above-mentioned voice filter V [s] in the reproduction environment.
[0140]
First, in the experiment for the first subjective evaluation, the loudspeaker 50 was randomly
rotated in the real environment, in the same place and under the same conditions as when the
impulse response was measured. Also, as mentioned above, the loudspeaker 50 used to measure
the impulse response for the audio filter was also used for the reproduction of audio in a real
environment. Then, the subject was asked to evaluate at the position where the microphone array
52 was placed when the impulse response was measured. In addition, the subject was allowed to
rotate his head during the experiment. However, the subject adjusted the height by sitting on a
chair or the like so that the position of his / her ear was at the height of the center of the
spherical skeleton (14a in FIG. 2) of the microphone array 52. Furthermore, in the experiment, a
curtain was provided in front of it (between the subject and the loudspeaker 50) to prevent the
loudspeaker 50 from being seen by the subject.
[0141]
The results obtained from the sound pressure level meter showed that the effect of the provision
of the curtain on the sound field was slight. Further, since the power output of the loudspeaker
50 is adjusted by a person other than the subject, the volume is not affected by the angle of the
face and the above two environments (real environment and reproduction environment).
[0142]
In the experiment for the second subjective evaluation, using the computer 18 (26 or 34) and the
03-05-2019
38
speaker array system 20 (28 or 36), as described above, the 0 to 90 ° increments by 15 ° The
stimulus sound was output using the above-mentioned voice filter V [s] so as to change the
[0143]
The subject was informed of the position of the loudspeaker 50 before the direction of the voice
was interrogated.
Also, in the experiment, the loudspeaker 50 is rotated in 15 ° increments from 0 ° to 90 °
counterclockwise, and in reverse (clockwise), from 90 ° to 0 °, in 15 ° increments. The
direction of the voice was changed by rotating, and the subject listened to the voice. According to
the question, the subject is first heard the voice at the 0 ° position and then the voice at the
same angle position twice. That is, since the direction of sound changes in steps of 15 ° between
0 ° and 90 °, one direction (angle) must be selected from seven directions. Seven voice
directions were tested in random order for each subject. The subjects answered a total of 14
questions in both real and reproduction environments.
[0144]
In each environment, angular errors can be defined as follows. In a real environment, the
absolute error of the angle at which the loudspeaker 50 is facing and the angle returned is
defined. Also, in the reproduction environment, the absolute error of the direction (angle) of the
sound to be reproduced and the angle answered is defined. FIG. 14 shows a box plot of mean
angular error for all subjects in each environment. As shown in FIG. 14, the average angular
errors in real and reproduction environments are 13.7 ° and 20.8 °, respectively. Taking into
consideration the virtual positional relationship between the three parties shown in FIG. 10
(placement of each user at the position of the apex of an equilateral triangle), the average angular
error in the reproduction environment is the degree to which one can perceive who is speaking
to whom It can be said that it is small.
[0145]
However, there is a 7.1 degree difference between the two environments between the mean
angular errors. Two-tailed t-test shows that the difference in mean angle error has a statistically
03-05-2019
39
significant difference (p <0.05). Therefore, it is understood by the subject that it is more difficult
to perceive the angle of the speech direction in the reproduction environment than in the real
environment. Also, most subjects commented that it is more difficult to perceive the angle of the
speech direction in the reproduction environment than in the real environment. And subjects
commented that the difference was the length of reverberation. In addition, in a common space
having sound wave reflection such as a conference room used in the experiment, it is considered
that the common space has a significant effect in perceiving the angle to which the late reflection
sound is directed.
[0146]
FIG. 15 is a bar graph showing the average angle error for each angle at which the speaker faces
(here, the angle at which the loudspeaker 50 faces or the angle of the speech direction
reproduced by the speaker array system 20 (28, 36)). . However, the bar graph with a grid
pattern is the average angle error for the real environment, and the bar graph with diagonal lines
is the average angle error for the reproduction environment.
[0147]
As can be seen from this FIG. 15, there is a significant difference between the two environments
when the angle of the speaker's face is 90 °. It is considered that this is because in some
subjects, it was not possible to perceive that the sound was rotated up to 90 degrees.
[0148]
FIG. 16 also shows a scatter plot of the mean angular error for each subject. That is, the
correlation between the real environment and the reproduction environment of the mean angular
error for each subject is shown. However, the numbers written in the circles are the numbers
attached to identify the subjects individually. The solid circle is a male subject, and the dotted
circle is a female subject.
[0149]
03-05-2019
40
In FIG. 16, half of the subjects show that the difference in perception of the speech directions in
the two environments is small. For the other half of the subjects, the perception of the angle of
the speech direction in the real environment is shown to be more accurate. One of the subjects
(females) who did not have much difference in the answer to the question in the two
environments clearly perceived the angle of the speech direction rotating from 0 ° to 90 ° in
the reproduction environment. These results show that there is individual difference in
recognizing the angle of the speech direction by the ability of the subject (hearing ability). And
FIG. 16 shows that there is almost no difference between the two environments, particularly in
female subjects.
[0150]
In the subjective evaluation experiment, the output power of the loudspeaker 50 was controlled
to keep the size (intensity) of the sound at each angle constant. However, when the three-party
conversation is actually performed using the sound reproduction system 10, the size (intensity)
of the voice naturally changes according to the direction (angle) to which the speaker faces. It is
conceivable that perception of the speech direction is easy.
[0151]
According to this embodiment, not only the voice can be reproduced, but also the direction of the
speaker's voice can be reproduced. Even by the reproduced speech, it is possible to perceive who
is talking to whom. Therefore, conversation can be conducted smoothly.
[0152]
In this embodiment, although the voice of the user wearing the headset microphone is
reproduced, it is not necessary to be limited to this. The sound of the instrument played by the
user or the sound of the hand clap played by the user may be reproduced. However, when the
user plays the musical instrument, it is necessary to detect the direction of the musical
instrument. For example, a gyro sensor is provided on the musical instrument, and the direction
of the musical instrument is detected according to the output of the gyro sensor. In addition,
when reproducing the sound of the hand clap performed by the user, a microphone is attached
near the user's wrist, and a gyro sensor is provided near the wrist or belly to detect the direction
in which the user's hand is or the direction of the body. Is provided.
03-05-2019
41
[0153]
Further, in this embodiment, the direction of the face of the user is detected from the video shot
by the camera, but it is not necessary to be limited to this. For example, a gyro sensor may be
mounted on the head (headset microphone) of the user, and the orientation of the user's face
may be detected based on the output of the gyro sensor.
[0154]
Also, in this embodiment, by installing a loudspeaker and a microphone array at a certain place
and measuring the impulse response, the transfer characteristic of voice is detected, and the
detected transfer characteristic is reflected in the voice filter. However, it is not necessary to be
limited to this. For example, the transfer characteristics for each angle ang can also be calculated
by simulation using a mirror image method. In such a case, the reflectance is set on the virtual
wall in the assumed environment, thereby generating a reflected sound.
[0155]
Furthermore, in this embodiment, only the case where the user is positioned at the position of
the vertex of the regular triangle is shown as a virtual positional relationship, but it is not
necessary to be limited to this. If a large number of transfer characteristics are prepared by
measuring or calculating the impulse response for various distances and various angles of the
loudspeaker with respect to the front direction of the microphone array, various positional
relationships between users can be accommodated. Sound can be reproduced.
[0156]
Furthermore, in this embodiment, the sound field data detected by the microphone array is also
reproduced, but the sound field data may not be reproduced.
[0157]
03-05-2019
42
Also, in this embodiment, although the conversation between three parties is reproduced, the
conversation between two or more parties can also be reproduced.
For example, in a conversation between four parties, it is conceivable to place the user at the
apex of a square having a side of a predetermined length as a virtual positional relationship.
Moreover, in conversation between five parties, it is conceivable to arrange the user at the apex
of a regular pentagon having sides of a predetermined length as a virtual positional relationship.
The same is true for the other cases. However, the actual positional relationship may be
expressed as a polygon and each user may be arranged at the vertex. In any case, an audio filter
is prepared in consideration of the transfer characteristics obtained by measurement and
calculation. In this embodiment, the number of microphones and loudspeakers used in the
microphone array and the speaker array system is reduced in consideration of the data
transmission rate in addition to the server and computer performance at present, but the
performance and transmission If the speed is improved, it is considered that sound field data and
audio data can be reproduced in real time without reducing the number of them.
[0158]
10 ... sound field sharing system 12 ... server 14 ... microphone array 18, 26, 34 ... computer 20,
28, 36 ... speaker array system 22, 30, 38 ... microphone 24, 32, 40 ... camera
03-05-2019
43
1/--страниц
Пожаловаться на содержимое документа