JP2015194557

Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2015194557
An object of the present invention is to suppress a decrease in accuracy in estimating a direction
of a speaker even if voice is collected while the user holds it. According to an embodiment, an
electronic device includes an acceleration sensor, an utterance direction estimation processing
unit, and a control unit. The acceleration sensor detects an acceleration. The speech direction
estimation processing means estimates the direction of the speaker using the phase difference of
the voice input to the microphone. The control means requests the speech direction estimation
processing means to initialize data related to the process of estimating the direction of the
speaker according to the acceleration detected by the acceleration sensor. [Selected figure]
Figure 3
Electronic device and control method of electronic device
[0001]
Embodiments of the present invention relate to techniques for estimating the direction of a
speaker.
[0002]
There has been developed an electronic device for estimating the direction of a speaker based on
the phase difference for each frequency component of speech input to a plurality of
microphones.
[0003]
JP, 2006-254226, A
03-05-2019
1
[0004]
If voice is collected while the user holds the electronic device, the accuracy of estimating the
direction of the speaker may be reduced.
[0005]
An object of the present invention is to provide an electronic device and a control method of the
electronic device which suppress the decrease in the accuracy of estimating the direction of the
speaker even if the voice is collected while the user holds it. .
[0006]
According to the embodiment, the electronic device comprises the acceleration sensor, the
speech direction estimation processing means, and the control means.
The acceleration sensor detects an acceleration.
The speech direction estimation processing means estimates the direction of the speaker using
the phase difference of the voice input to the microphone.
The control means requests the speech direction estimation processing means to initialize data
related to the process of estimating the direction of the speaker according to the acceleration
detected by the acceleration sensor.
[0007]
The perspective view which shows an example of the external appearance of the electronic
device of embodiment.
FIG. 2 is a block diagram showing the configuration of the electronic device of the embodiment.
03-05-2019
2
Functional block diagram of the recording application. FIG. 7 shows source directions and the
arrival time differences observed in acoustic signals. The figure which shows the relationship
between a flame ¦ frame and frame shift amount. The figure which shows the procedure and
short-time Fourier-transform data of FFT process. The functional block diagram of the speech
direction estimation part. FIG. 2 is a functional block diagram showing an internal configuration
of each of a two-dimensional data conversion unit and a figure detection unit. The figure which
shows the procedure of phase difference calculation. The figure which shows the procedure of
coordinate value calculation. FIG. 2 is a functional block diagram showing an internal
configuration of a sound source information generation unit. The figure for demonstrating
direction estimation. The figure which shows the relationship between (theta) and (DELTA) T. FIG.
6 is a view showing an example of a screen displayed by the user interface display processing
unit. The flowchart which shows the procedure which initializes the data which relates to speaker
identification.
[0008]
Embodiments will be described below with reference to the drawings.
[0009]
First, the configuration of the electronic device of the present embodiment will be described with
reference to FIG.
This electronic device can be realized as a portable terminal, for example, a tablet personal
computer, a laptop or notebook personal computer, a PDA. Hereinafter, the electronic device is
referred to as a tablet personal computer 10 (hereinafter, referred to as a computer 10). Assume
that the case is realized as
[0010]
FIG. 1 is a view showing the appearance of the computer 10. The computer 10 comprises a
computer main body 11 and a touch screen display 17. The computer main body 11 has a thin
box-shaped housing. The touch screen display 17 is disposed on the surface of the computer
main body 11. The touch screen display 17 includes a flat panel display (for example, a liquid
crystal display (LCD)) and a touch panel. The touch panel is provided to cover the screen of the
03-05-2019
3
LCD. The touch panel is configured to detect the position on the touch screen display 17 touched
by the user's finger or pen.
[0011]
FIG. 2 is a block diagram showing the system configuration of the computer 10. The computer
10 is, as shown in FIG. 2, a touch screen display 17, a CPU 101, a system controller 102, a main
memory 103, a graphics controller 104, a BIOS-ROM 105, a non-volatile memory 106, an
embedded controller (EC) 108, a microphone 109A, 109B, an acceleration sensor 110, and the
like.
[0012]
The CPU 101 is a processor that controls the operation of various modules in the computer 10.
The CPU 101 executes various software loaded from the non-volatile memory 106, which is a
storage device, to the main memory 103, which is a volatile memory. The software includes an
operating system (OS) 200 and various application programs. The various application programs
include a recording application 300.
[0013]
The CPU 101 also executes a basic input / output system (BIOS) stored in the BIOS-ROM 105.
The BIOS is a program for hardware control.
[0014]
The system controller 102 is a device that connects between the local bus of the CPU 101 and
various components. The system controller 102 also incorporates a memory controller that
controls access to the main memory 103. The system controller 102 also has a function of
executing communication with the graphics controller 104 via a PCI Express standard serial bus
or the like.
03-05-2019
4
[0015]
The graphics controller 104 is a display controller that controls the LCD 17A used as a display
monitor of the computer 10. The display signal generated by the graphics controller 104 is sent
to the LCD 17A. The LCD 17A displays a screen image based on the display signal. A touch panel
17B is disposed on the LCD 17A. The touch panel 17B is a capacitive pointing device for
performing input on the screen of the LCD 17A. The touch position on the screen where the
finger is touched, the movement of the touch position, and the like are detected by the touch
panel 17B.
[0016]
The EC 108 is a one-chip microcomputer including an embedded controller for power
management. The EC 108 has a function of powering on or off the computer 10 in accordance
with the operation of the power button by the user.
[0017]
The acceleration sensor 110 detects an acceleration applied to the electronic device 10 in the x,
y, z directions. It is possible to detect the orientation of the electronic device 10 by detecting the
acceleration in the x, y, z axis directions.
[0018]
FIG. 3 is a functional block diagram of the recording application 300. As shown in FIG. A
frequency decomposition unit 301, a voice section detection unit 302, an utterance direction
estimation unit 303, a speaker clustering unit 304, a user interface display processing unit 305,
a recording processing unit 306, a control unit 307, and the like are provided.
[0019]
The recording processing unit 306 performs a recording process by performing compression
processing and the like on the audio data input from the microphone 109A and the microphone
109B and storing the audio data in the storage device 106.
03-05-2019
5
[0020]
The control unit 307 can control the operation of each unit of the recording application 300.
[0021]
[Basic concept of sound source estimation based on phase difference for each frequency
component] The microphones 109A and 109B are two microphones arranged at a
predetermined distance in a medium such as air, and the medium vibrations at two different
points. It is a means for converting (sound wave) into an electric signal (acoustic signal).
Hereinafter, when the microphones 109A and the microphones 109B are handled collectively,
this is referred to as a microphone pair.
[0022]
The acoustic signal input unit 2 digitizes two acoustic signals by the microphones 109A and
109B by periodically A / D converting the two acoustic signals by the microphones 109A and
109B at a predetermined sampling period Fr. Amplitude data is generated in time series.
[0023]
Assuming that the sound source is positioned far enough compared to the distance between the
microphones, as shown in FIG. 4A, the wave front 401 of the sound wave emitted from the sound
source 400 and reaching the microphone pair becomes almost a plane .
When this plane wave is observed at two different points by using the microphone 109A and the
microphone 109B, the microphone according to the direction R of the sound source 400 with
respect to the line segment 402 connecting the microphone 109A and the microphone 109B
(this is called a baseline). A predetermined arrival time difference ΔT should be observed in the
acoustic signals converted in pairs.
When the sound source is sufficiently far, this arrival time difference ΔT is 0 when the sound
03-05-2019
6
source 400 exists on a plane perpendicular to the baseline 402, and this direction is defined as
the front direction of the microphone pair.
[0024]
[Frequency Decomposition Unit] There is a fast Fourier transform (FFT) as a general method of
decomposing amplitude data into frequency components. As a representative algorithm, the
Cooley-Turkey DFT algorithm is known.
[0025]
As shown in FIG. 5, the frequency decomposition unit 301 extracts continuous N pieces of
amplitude data from the amplitude data 410 from the acoustic signal input unit 2 as a frame (Tth frame 411) and performs fast Fourier transform, The extraction position is repeated while
being shifted by the frame shift amount 413 (T + 1st frame 412).
[0026]
The amplitude data constituting the frame is subjected to windowing 601 as shown in FIG. 6A,
and then fast Fourier transform 602 is performed.
As a result, short-time Fourier transform data of the input frame is generated in the real part
buffer R [N] and the imaginary part buffer I [N] (603). An example of the windowing function
(Hamming windowing or Hanning windowing) 605 is shown in FIG. 6 (B).
[0027]
The short time Fourier transform data generated here is data obtained by decomposing the
amplitude data of the frame into N / 2 frequency components, and the real part R [k] in the
buffer 603 for the kth frequency component fk The numerical value of the imaginary part I [k]
represents a point Pk on the complex coordinate system 604 as shown in FIG. 6 (C). The square
of the distance of Pk from the origin O is the power Po (fk) of the frequency component, and the
signed rotation angle θ {θ: -π> θ ≧ π [radian]} from the real part axis of Pk is It is the phase
Ph (fk) of the frequency component.
03-05-2019
7
[0028]
When the sampling frequency is Fr [Hz] and the frame length is N [samples], k takes an integer
value from 0 to (N / 2) -1, and k = 0 is 0 [Hz] (direct current), k = (N / 2) -1 represents Fr / 2 [Hz]
(highest frequency component), and the interval is equally divided by the frequency resolution
Δf = (Fr / 2) N ((N / 2) -1) [Hz] Is the frequency at each k, and is expressed by fk = k · Δf.
[0029]
Note that, as described above, the frequency decomposition unit 301 continuously performs this
process at a predetermined interval (frame shift amount Fs) to generate a frequency consisting of
a power value and a phase value for each frequency of input amplitude data. Generate
decomposition datasets in time series.
[0030]
[Voice Section Detection Unit] The voice section detection unit 302 detects a voice section based
on the result of the frequency decomposition unit 301.
[0031]
[Utteration Direction Estimating Unit] The speech direction estimating unit 303 detects the
speech direction of the speech segment based on the detection result of the speech segment
detection unit 302.
FIG. 7 is a functional block diagram of the speech direction estimation unit 303. As shown in FIG.
The speech direction estimation unit 303 includes a two-dimensional data conversion unit 701, a
figure detection unit 702, a sound source information generation unit 703, and an output unit
704.
[0032]
03-05-2019
8
(Two-Dimensional Data Generation Unit and Graphic Detection Unit) As shown in FIG. 8, the twodimensional data generation unit 701 includes a phase difference calculation unit 801 and a
coordinate value determination unit 802.
The figure detection unit 702 includes a voting unit 811 and a straight line detection unit 812.
[0033]
[Phase difference calculation unit] The phase difference calculation unit 801 compares the two
frequency-resolved data sets a and b at the same time obtained by the frequency decomposition
unit 301, and determines the difference between the phase values of the two frequency
components. To generate phase difference data between a and b obtained by calculating For
example, as shown in FIG. 9, the phase difference .DELTA.Ph (fk) of a certain frequency
component fk is calculated by calculating the difference between the phase value Ph1 (fk) at the
microphone 109A and the phase value Ph2 (fk) at the microphone 109B. It is calculated as a
remainder system of 2π so as to fall within {ΔPh (fk): −π <ΔPh (fk) ≦ π}.
[0034]
[Coordinate value determination unit] The coordinate value determination unit 802 calculates
phase difference data obtained by calculating the difference between the phase values of the two
frequency components based on the phase difference data obtained by the phase difference
calculation unit 801. Means for determining coordinate values to be treated as points on the twodimensional XY coordinate system of The X coordinate value x (fk) and the Y coordinate value y
(fk) corresponding to the phase difference ΔPh (fk) of a certain frequency component fk are
determined by the equation shown in FIG. The X coordinate value is a phase difference ΔPh (fk),
and the Y coordinate value is a frequency component number k.
[0035]
[Voting section] Voting section 811 applies a linear Hough transform to each frequency
component to which (x, y) coordinates are given by coordinate value determining section 802,
and sets its locus to a Hough voting space by a predetermined method. It is a means to vote.
[0036]
03-05-2019
9
[Line Detection Unit] The line detection unit 812 is a means for analyzing a vote distribution on
the Hough voting space generated by the voting unit 811 to detect an effective line.
[0037]
[Sound Source Information Generating Unit] As shown in FIG. 11, the sound source information
generating unit 703 includes a direction estimating unit 1111, a sound source component
estimating unit 1112, a sound source sound resynthesizing unit 1113, a time series tracking unit
1114, and a duration time. An evaluation unit 1115, an in-phase unit 1116, an adaptive array
processing unit 1117, and a voice recognition unit 1118 are provided.
[0038]
[Direction Estimating Unit] The direction estimating unit 1111 receives the straight line detection
result by the straight line detecting unit 812 described above, that is, the θ value for each
straight line group, and calculates the existence range of the sound source corresponding to each
straight line group.
At this time, the number of detected straight line groups is the number of sound sources (all
candidates).
If the distance to the sound source is sufficiently far from the baseline of the microphone pair,
the range of the sound source is a conical surface with an angle to the baseline of the
microphone pair.
This will be described with reference to FIG.
[0039]
The arrival time difference ΔT between the microphone 109A and the microphone 109B can
change in the range of ± ΔTmax. As shown in FIG. 12A, when incident from the front, ΔT is 0,
and the azimuth angle φ of the sound source is 0 ° with reference to the front. Also, as shown
in FIG. 12B, when the sound is incident right from the right, ie, from the direction of the
microphone 109B, ΔT is equal to + ΔTmax, and the azimuth angle φ of the sound source is +
03-05-2019
10
90 ° with positive clockwise with reference to the front. . Similarly, as shown in FIG. 12C, when
the voice is incident from the left just side, that is, from the direction of the microphone 109A,
ΔT is equal to −ΔTmax, and the azimuth angle φ is −90 °. Thus, ΔT is defined as positive
when the sound is incident from the right and negative when the sound is incident from the left.
[0040]
Based on the above, general conditions as shown in FIG. 12 (D) are considered. Assuming that the
position of the microphone 109A is A, the position of the microphone 109B is B, and the voice is
incident from the direction of the line segment PA, ΔPAB is a right triangle whose apex P is at a
right angle. At this time, with an inter-microphone center O and a line segment OC as the front
direction of the microphone pair, an angle that positively assumes a counterclockwise direction
with the OC direction being an azimuth angle of 0 ° is defined as an azimuth angle φ. Since
ΔQOB is a similar form of ΔPAB, the absolute value of the azimuthal angle φ is equal to ∠OBQ,
ie, BPABP, and the sign corresponds to the sign of ΔT. Also, ∠ABP can be calculated as sin−1 of
the ratio of PA to AB. At this time, when the length of the line segment PA is represented by ΔT
corresponding to this, the length of the line segment AB corresponds to ΔTmax. Thus, including
the sign, the azimuth can be calculated as φ = sin−1 (ΔT / ΔTmax). Then, the existence range
of the sound source is estimated as a conical surface 1200 opened at (90−φ) ° with the point
O as a vertex and the baseline AB as an axis. The source is somewhere on this conical surface
260.
[0041]
As shown in FIG. 13, ΔTmax is a value obtained by dividing the distance between microphones L
[m] by the sound velocity Vs [m / sec]. At this time, it is known that the sound velocity Vs can be
approximated as a function of the air temperature t [° C.]. Now, it is assumed that the straight
line 1300 is detected by the straight line detection unit 812 at the inclination θ of Hough. Since
the straight line 1300 is inclined to the right, θ is a negative value. When y = k (frequency fk),
the phase difference ΔPh indicated by the straight line 1300 can be obtained by k · tan (−θ) as
a function of k and θ. At this time, ΔT [sec] is a time obtained by multiplying one period (1 / fk)
[sec] of the frequency fk by the ratio of the phase difference ΔPh (θ, k) to 2π. Since θ is a
signed quantity, ΔT is also a signed quantity. That is, when the sound is incident from the right
in FIG. 12D (the phase difference ΔPh has a positive value), θ has a negative value. Also, when
the sound is incident from the left in FIG. 12D (the phase difference ΔPh has a negative value),
θ has a positive value. Therefore, the sign of θ is inverted. In the actual calculation, the
calculation may be performed at k = 1 (the frequency immediately above the DC component k =
03-05-2019
11
0).
[0042]
[Sound source component estimation unit] The sound source component estimation unit 1112
evaluates the distance between the (x, y) coordinate value for each frequency component given
by the coordinate value determination unit 802 and the straight line detected by the straight line
detection unit 812 Thus, a point located in the vicinity of the straight line (ie, frequency
component) is detected as a frequency component of the straight line (ie, sound source), and the
frequency component for each sound source is estimated based on the detection result.
[0043]
[Sound source sound re-synthesizing portion] The sound source sound re-synthesizing portion
1113 performs inverse FFT processing on the frequency components at the same acquisition
time constituting each sound source sound to generate the sound source sound (amplitude
(amplitude of Re-synthesize the data).
As illustrated in FIG. 5, one frame overlaps with the next frame with a time difference of the
amount of frame shift. As described above, in the overlapping period in a plurality of frames, the
amplitude data of all the overlapping frames can be averaged to form final amplitude data. Such
processing makes it possible to separate and extract the source sound as its amplitude data.
[0044]
[Time Series Tracking Unit] The straight line detection unit 812 obtains a straight line group
every Hough voting by the voting unit 811. Hough voting is performed collectively on m
consecutive (m ≧ 1) FFT results. As a result, the straight line group can be obtained in time
series as a period of m frames as a period (this will be referred to as a graphic detection
period ). Further, since θ of the straight line group corresponds to the sound source direction
φ calculated by the direction estimation unit 111 one to one, it corresponds to a stable sound
source even if the sound source is stationary or moving. The trajectory on the time axis of θ (or
φ) should be continuous. On the other hand, the straight line group detected by the straight line
detection unit 812 includes a straight line group corresponding to background noise (referred to
as a noise straight line group ) depending on how the threshold is set. is there. However, it
can be expected that the trajectory on the time axis of θ (or φ) of such a noise linear group is
03-05-2019
12
not continuous or is short.
[0045]
The time-series tracking unit 1114 is a means for obtaining a locus on the time axis of φ by
dividing φ thus obtained for each figure detection cycle into continuous groups on the time axis.
[0046]
[Duration evaluation unit] The duration evaluation unit 1115 calculates the duration of the
trajectory from the start time and end time of the trace data whose tracking has been output,
which is output by the time series tracking unit 1114, and the duration is a predetermined
threshold Are recognized as trajectory data based on sound source sound, and others are
recognized as trajectory data based on noise.
Trajectory data based on sound source sound is called sound source stream information. The
sound source stream information includes time-series trajectory data of the start time Ts and the
end time Te of the sound source sound, and θ, ρ, and φ representing the sound source
direction. Although the number of straight line groups by the figure detection unit 702 gives the
number of sound sources, noise sources are also included therein. The number of sound source
stream information by the duration evaluation unit 1115 gives the number of reliable sound
sources excluding those based on noise.
[0047]
[In-phase unit] The in-phase unit 1116 obtains the temporal transition of the sound source
direction φ of the stream by referring to the sound source stream information by the time series
tracking unit 1114, and obtains an intermediate value from the maximum value φ max and the
minimum value φ min of φ. The value φmid = (φmax + φmin) / 2 is calculated to determine
the width φw = φmax−φmid. Then, the time series data of the two frequency-resolved data
sets a and b that are the sources of the sound source stream information is extracted from the
time when the predetermined time goes back from the start time Ts of the stream to the time
when the predetermined time elapses from the end time Te. Then, the phase difference is made
to be in phase by correcting so as to cancel the arrival time difference which is inversely
calculated by the intermediate value φmid.
03-05-2019
13
[0048]
Alternatively, the time-series data of the two frequency-resolved data sets a and b can be always
in phase, with the sound source direction φ at each time by the direction estimation unit 1111
as φmid. Whether to refer to sound source stream information or φ at each time is determined
in the operation mode, and this operation mode can be set and changed as a parameter.
[0049]
[Adaptive array processing unit] The adaptive array processing unit 1117 directs the center
directivity to the front 0 ° of the time-series data of the two frequency-resolved data sets a and
b subjected to the extraction and the in-phase, and gives a predetermined margin to ± φw. The
time series data of the frequency component of the sound source sound of the stream is
separated and extracted with high accuracy by applying adaptive array processing in which the
value obtained by Although this method differs in method, it functions in the same manner as the
sound source component estimation unit 1112 in that time series data of frequency components
are separated and extracted. Therefore, the sound source sound re-synthesis unit 1113 can also
re-synthesize the amplitude data of the sound source sound from the time series data of the
frequency component of the sound source sound by the adaptive array processing unit 1117.
[0050]
As adaptive array processing, as described in reference 3 Amada et al. Microphone array
technology for speech recognition , Toshiba review 2004, VOL. 59, NO. 9, 2004 , the beam
itself is a beam. It is possible to apply a method for clearly separating and extracting speech
within a set directivity range, such as using Griffith-Jim type generalized sidelobe canceller
known as a construction method of a former as the main and sub two. it can.
[0051]
Usually, when using adaptive array processing, in order to set the tracking range in advance and
to use only the voice from that direction to wait, in order to wait for voice from all directions, a
large number of adaptive arrays with different tracking ranges Needed to be prepared.
03-05-2019
14
On the other hand, in this embodiment, after the number of sound sources and the direction
thereof are actually obtained, only the adaptive array of the number according to the number of
sound sources can be operated, and the following range is also predetermined according to the
direction of the sound sources. Since a narrow range can be set, speech can be separated and
extracted efficiently and with good quality.
[0052]
At this time, by making the time series data of the two frequency resolved data sets a and b in
phase in advance, it is possible to process the sound in any direction only by setting the tracking
range in the adaptive array processing only in the front. It will be.
[0053]
[Voice recognition unit] The voice recognition unit 1118 analyzes and collates time-series data of
frequency components of the sound source extracted by the sound source component estimation
unit 1112 or the adaptive array processing unit 1117 to obtain symbolic content of the stream,
That is, it extracts symbols (strings) representing linguistic meanings, types of sound sources,
and distinction of speakers.
[0054]
In addition, each functional block from the direction estimation unit 1111 to the voice
recognition unit 1118 can exchange information by a wire connection not shown in FIG. 11 as
needed.
[0055]
The output unit 704 outputs the number of sound sources obtained as the number of straight
line groups by the figure detection unit 702 as sound source information by the sound source
information generation unit 703 and the generation sources of acoustic signals estimated by the
direction estimation unit 1111 Spatial existence range of sound source (angle φ for determining
conical surface), component configuration of speech emitted from each sound source (time series
data of power and phase for each frequency component) estimated by the sound source
component estimation unit 1112 A noise source determined based on the separated speech
(time-series data of amplitude value) separated for each sound source synthesized by the sound
source re-synthesis unit 1113, the time series tracking unit 1114 and the duration evaluation
unit 1115 The number of sound sources to be excluded, the temporal existence period of the
voice emitted by each sound source determined by the time series tracking unit 1114 and the
duration evaluation unit 1115, the in-phase unit 1116 and the adaptive array processing 1117
and the determined, (time series data of the amplitude value) separating speech for each sound
03-05-2019
15
source, obtained by the speech recognition unit 1118, symbolic content of each source audio is
means for outputting information including at least one of.
[0056]
[Speaker Clustering Unit] The speaker clustering unit 304 generates the speaker identification
information 310 for each time based on the temporal existence period of the voice generated by
each sound source output from the output unit 704.
The speaker identification information 310 includes information associated with the speech start
time and the speech start time.
[0057]
[User Interface Display Processing Unit] The user interface display processing unit 305 presents
various setting contents necessary for the above-mentioned sound signal processing to the user,
accepts setting input from the user, and saves the setting contents in the external storage device.
And reading from the external storage device, (1) display of frequency components for each
microphone, (2) display of phase difference (or time difference) plot (that is, display of twodimensional data), (3) various votes Display of distribution, (4) Display of maximum position, (5)
Display of straight line group on plot, (6) Display of frequency component belonging to straight
line group, (7) Display of locus data, etc. Various processing It is a means for visualizing the
results and intermediate results and presenting them to the user, and allowing the user to select
desired data for more detailed visualization.
By doing this, the user can confirm the function of the acoustic signal processing apparatus
according to the present embodiment, make adjustments so that the user can perform a desired
operation, and thereafter use the apparatus in an adjusted state. Become possible.
[0058]
The user interface display processing unit 305 displays, for example, a screen shown in FIG. 14
on the LCD 17A based on the speaker identification information 310.
03-05-2019
16
[0059]
At the top of the LCD 17A, objects 1401, 1402 and 1403 indicating speakers are shown.
At the bottom of the LCD 17A, objects 1411A, 1411A, 1412, 1413A, 1413B corresponding to
the speaking time of the speaker are displayed.
When there is an utterance, objects 1411A, 1411A, 1412, 1413A, 1413B are displayed to flow
with time from right to left.
The objects 1411A, 1411A, 1412, 1413A, and 1413B are displayed in colors corresponding to
the objects 1401, 1402, and 1403, respectively.
[0060]
By the way, the speaker identification using the phase difference of the distance between the
microphones loses accuracy when the terminal is moved during recording. The present device
uses the acceleration in the x, y, z axis directions obtained from the acceleration sensor 110 and
the inclination of the terminal for speaker identification to suppress convenience degradation
due to accuracy degradation.
[0061]
The control unit 307 requests the speech direction estimation unit 303 to initialize data related
to the process of estimating the direction of the speaker according to the acceleration detected
by the acceleration sensor.
[0062]
FIG. 15 is a flow chart showing a procedure for initializing data relating to speaker identification.
[0063]
03-05-2019
17
The control unit 307 determines whether the difference between the current tilt of the device 10
obtained from the acceleration sensor 110 and the tilt of the device 10 at the start of the speaker
identification exceeds a threshold (step B11).
When it is determined that the threshold value is exceeded (Yes in step B11), the control unit
307 requests the speech direction estimation unit 303 to initialize data relating to speaker
identification (step B12).
The speech direction estimation unit 303 initializes data relating to speaker identification (step
B13). Then, the speech direction estimation unit 303 performs speaker identification processing
based on the data newly generated by each unit in the speech direction estimation unit 303.
[0064]
If it is determined that the initial state has not been exceeded (No in step B12), the control unit
307 determines that the acceleration values in the x, y, and z directions of the device 10 obtained
from the acceleration sensor 110 take periodic values. It is determined whether it has become
(step B14). When it is determined that the value of the acceleration has a periodic value (Yes in
step B13), the control unit 307 requests the recording processing unit 306 to stop the recording
process (step B15). Further, the control unit 307 requests the frequency decomposition unit 301,
the speech segment detection unit 302, the speech direction estimation unit 303, and the
speaker clustering unit 304 to stop the processing. The recording processing unit 306 stops the
recording process (step B16). The frequency decomposition unit 301, the speech segment
detection unit 302, the speech direction estimation unit 303, and the speaker clustering unit 304
stop the processing.
[0065]
According to the present embodiment, the state held by the user by requesting the speech
direction estimation unit 303 to initialize data related to the process of estimating the direction
of the speaker according to the acceleration detected by the acceleration sensor 110 It is
possible to suppress the decrease in the accuracy of estimating the direction of the speaker even
if the voice is collected by
[0066]
Note that since various processes of the present embodiment can be realized by a computer
03-05-2019
18
program, the computer program can be installed in a computer through a computer-readable
storage medium storing the computer program and executed. Similar effects can be easily
realized.
[0067]
While certain embodiments of the present invention have been described, these embodiments
have been presented by way of example only, and are not intended to limit the scope of the
invention.
These novel embodiments can be implemented in various other forms, and various omissions,
substitutions, and modifications can be made without departing from the scope of the invention.
These embodiments and modifications thereof are included in the scope and the gist of the
invention, and are included in the invention described in the claims and the equivalent scope
thereof.
[0068]
DESCRIPTION OF SYMBOLS 10 Tablet-type personal computer (electronic equipment) 101 CPU
103 main memory 106 storage device 108 embedded controller 109 A microphone 109 B
microphone 110 acceleration sensor 200 operating system 300 Recording application 301
Frequency separation unit 302 Voice section detection unit 303 Speech direction estimation unit
304 Speaker clustering unit 305 User interface display processing unit 306 Recording
processing unit 307 Control unit
03-05-2019
19