JPH0713586

Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JPH0713586
[0001]
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a
voice discriminating apparatus which is used as a pre-processing apparatus such as a video
apparatus and an audio apparatus, and which automatically discriminates whether or not a
continuously inputted acoustic signal is a voice. And an audio reproduction apparatus using the
audio determination apparatus.
[0002]
2. Description of the Related Art In recent years, functions called "surround" for creating sound
effects have been installed in stereo devices and television vision receivers (hereinafter referred
to as "televisions"). While these functions are effective for music and other sources, they are less
clear for voice-based sources such as news programs. Therefore, if it can be automatically
determined whether the source is voice-based or not, it is possible to optimally control the sound
field and frequency characteristics according to the result.
[0003]
The conventional voice discrimination device utilizes that the input signal is a stereo signal. That
is, in the case of a source such as music, the signals of the left channel (hereinafter referred to as
L channel) and the right channel (hereinafter referred to as R channel) are independent of each
08-05-2019
1
other, and the correlation between both channels is low. Conversely, in the case of an audiobased source such as a news program, the sound is localized at the center, and the left signal
(hereinafter referred to as L signal) and the right signal (hereinafter referred to as R signal) are
almost the same signal. The correlation between both channels is high. Therefore, the difference
between the amplitudes of the L signal and the R signal is calculated, and when the difference is
small, it is determined as an audio signal, and when the difference is large, it is determined as a
signal other than voice. In addition, the correlation value of the L signal and the R signal may be
calculated, and when the correlation value is large, it may be a voice signal, and when it is small,
it may be a signal other than voice.
[0004]
In such a conventional voice discrimination apparatus, although there is an effect for the source
of the Surereo, there is a problem that it is impossible to discriminate for the monaural source in
which there is no difference between the L signal and the R signal. there were.
[0005]
The present invention solves the above-mentioned problems, and according to a source, a voice
discrimination device capable of accurately discriminating whether it is voice or not for both
monaural and stereo signals, and using this voice discrimination device An object of the present
invention is to provide a sound reproduction device capable of automatically controlling sound
characteristics.
[0006]
According to a first aspect of the present invention, there is provided a power calculation unit for
calculating an acoustic power of an acoustic signal for each frame of a fixed time, and a threshold
value in which the calculated acoustic power value is set in advance. And a zero crossing
calculation unit for calculating the number of zero crossings of the waveform of the acoustic
signal for each frame, and the calculated number of zero crossings in advance. A consonant
determination unit that determines the consonant property of the frame by comparing with a set
threshold value, and the constancy that detects the maximum value and the minimum value of
the power value in consecutive predetermined plural frame sections and calculates the difference
value A determination unit, an existence ratio of frames determined to be silent in the plurality of
frames, an existence ratio of frames determined to be high in consonant property, and the
difference value set in advance. When all the values are larger than the value, the sound signal of
the plurality of frame sections is determined to be voice, and the sound signal of the plurality of
frame sections is determined to be non-speech otherwise, A voice discrimination device
08-05-2019
2
comprising
Further, according to the present invention, the power calculator for calculating the sound power
of the sound signal for each frame of a fixed time, and the calculated sound power value in
comparison with the threshold set in advance The noise / non-sound determination unit for
determining the noise / silence, the zero-crossing calculation unit for calculating the number of
zero-crossings of the waveform of the acoustic signal for each frame, and the calculated number
of zero-crossings are compared with a preset threshold value And a stationarity determining unit
that detects a maximum value and a minimum value of power values in a plurality of continuous
predetermined frame intervals, and determines the plurality of frames. When the presence ratio
of frames determined to be silent in step a, the presence ratio of frames determined to be high in
consonant, and the difference value are all greater than the threshold values set in advance It is a
case where it is determined as voice and not as voice, and the existence ratio of frames
determined to be silent in the plurality of frames and the difference value are respectively
smaller than the threshold set smaller than the threshold in advance. In this case, it is determined
that the sound signal in the plurality of frames is non-speech, and in the other cases, the sound
signal in the plurality of frames is determined to be indeterminate, and the sound determination
unit The apparatus according to claim 3 is an apparatus according to claim 1, wherein an audio
signal is input to determine an audio non-audio, and the audio signal and the audio signal for
each predetermined time of the audio identification apparatus. Acoustic re-recognition in which
the voice non-speech discrimination result is input, and the frequency characteristic of the
acoustic signal is stepwise changed to the optimal frequency characteristic according to the
speech non-speech discrimination result The apparatus according to claim 4 is an apparatus
according to claim 2, wherein an audio signal is input to determine an audio non-audio, and the
audio signal and the audio signal for each predetermined time of the audio determination
apparatus. According to another aspect of the present invention, there is provided an audio
reproduction apparatus in which an audio non-audio discrimination result is input and the
frequency characteristic of the audio signal is changed stepwise to an optimal frequency
characteristic according to the audio non-audio discrimination result.
[0007]
In the present invention according to claim 1, the power calculation unit calculates the signal
power of the frame section of the acoustic signal, determines whether the section is voiced or
silent from the calculated magnitude of the power, and performs zero crossing. The number
calculation unit calculates the number of zero crossings of the frame section of the sound signal,
and the consonant determining unit determines the consonance of the section from the
calculated number of zero crossings.
08-05-2019
3
The stationarity determining unit calculates a difference value between the maximum value and
the minimum value of power in a plurality of consecutive frame sections. The voice discriminator
determines that in a plurality of frame intervals, the sound signal in the plurality of frame
segments is voiced when the abundance ratio of silence frames, the abundance ratio of
consonant frames, and the power difference value are greater than the respective threshold
values. It is determined that
[0008]
Further, in the invention according to claim 2, the speech discrimination unit is in a plurality of
frame sections, and the speech discrimination unit is in a plurality of frame sections, the
existence ratio of silence frames, the existence ratio of consonant frames, and the power
difference value The sound signal is determined to be voice when it is larger than the threshold,
and can not be determined to be voice, and the existence ratio of silent frames and the power
difference value are set smaller than the respective thresholds. When it is smaller than the value,
it is determined as non-speech, otherwise it is determined as undefined.
[0009]
Further, in the invention according to claim 3 and claim 4, the audio music discrimination unit
determines whether the audio signal is audio or not, and the frequency characteristic control unit
determines the frequency characteristic of the input audio signal based on the determination
result. The frequency characteristics are switched in stages to a frequency characteristic suitable
for an acoustic signal and output.
[0010]
(Embodiment 1) Hereinafter, an embodiment of the speech discrimination apparatus of the
present invention will be described with reference to the drawings.
[0011]
FIG. 1 is a block diagram showing the configuration of this embodiment.
In the figure, 1 is a power calculation unit that calculates the power of the input signal, 2 is a
zero crossing calculation unit that calculates the number of zero crossings of the waveform for
08-05-2019
4
each frame, 3 is by comparing the calculated power with a threshold A voiced silence
determining unit that determines whether a frame input signal is voiced or silent; a consonant
determining unit 4 that determines the presence or absence of consonantness of the frame based
on the number of zero crossings for each frame; 5 is a stationarity determination unit that
determines stationarity based on the difference between the maximum value and the minimum
value of power for a plurality of constant frames; 6 is the ratio of the number of silence
determination frames in the plurality of frames; The voice determination unit determines
whether the voice is a voice or a non-voice for each of a plurality of frames based on the
difference of the minimum power and the ratio of the number of frames of which the number of
zero crossings occupies in a plurality of frames is a predetermined number or more.
[0012]
The interrelationship and operation of the above components will be described.
Here, the input signal is a signal of a device such as an audio device or a television, and is a
stereo signal.
The L signal and R signal of the input stereo signal are mixed and input to the power calculation
unit 1 as a (L + R) signal. The power calculation unit 1 calculates, for each frame of a fixed time
interval, the accumulated value or the average value of the amplitude of the section as a power
value in the frame. The zero crossing calculation unit 2 calculates the number of times the input
waveform crosses the zero amplitude value as the number of zero crossings Z0 for each frame. In
the case of speech, the number of zero crossings indicates a large value, especially in unvoiced
frictional consonants. The consonant determining unit 4 determines that the consonant is high if
the number of times of zero crossing Z0 of the frame obtained by the zero crossing calculating
unit 2 is satisfied. Here, Zt is a threshold value set in advance to determine consonant, and the
experimental result shows that about 40 times is a reasonable value when the sampling
frequency is 10 kHz and the frame length is 20 ms. . The number of frames determined to be
high in consonant is accumulated in a fixed number of frames. Let this cumulative value be NZ.
[0013]
The noise / non-voice determination unit 3 uses the power value obtained by the power
calculation unit 1 to make a voice / non-voice determination for each frame. Here, assuming that
08-05-2019
5
the power value of the current frame is P and the threshold of the sound / silence determination
is Pt, it is determined that the noise is not satisfied, and the number of frames determined to be
silence is accumulated in a constant plural frame unit. The accumulated value of the number of
frames is Np.
[0014]
Here, the threshold value Pt is a value set in advance, but may be determined adaptively
according to the fluctuation of the input level. The above processing is processing in units of one
frame.
[0015]
It is assumed that the following processing is performed with a plurality of F frames as one unit.
Here, the processing interval F is a unit with which the feature of the voice can be confirmed at a
minimum, and in the case of voices uttered continuously in practice, a value that includes two or
three syllables on average (for example, 1 second to 2) It may be set to 2 seconds). The larger the
value F is, the more accurate the speech likeness can be detected, but the time required for the
determination becomes longer, so it is determined by the trade-off between the two.
[0016]
From the accumulated value NZ of the number of frames determined to be highly consonant in
this F frame interval and the accumulated value NP of the number of frames determined to be
silent, the existence ratio of highly consonant frames in F frame interval is NZ / F, F The
abundance ratio of silence intervals in a frame interval is given as NP / F.
[0017]
In addition, the stationarity determining unit 5 detects the maximum value and the minimum
value of the power for each of the F frames, and calculates the difference value Pd.
Since the voice uttered continuously is a repetition of vowels, consonants and silences, the
change in power, that is, the value of Pd is naturally large when viewed at a certain time interval
08-05-2019
6
(in this case, F frame). Therefore, the speech likeness is determined by the value of this value Pd.
[0018]
The voice determination unit 6 uses the Nz, Np, and Nd obtained by the voice presence / nonvoice determination unit 3, the stationarity determination unit 4, and the consonant feature
calculation unit 5, respectively, to generate a high ratio of silent segments and high consonant It
is determined to be silent when all of the presence ratio and the condition of the power
difference value, that is, the determination formulas shown below are satisfied.
[0019]
a <(NZ / F) <b (Np / F)> cPd> Pdtv However, a, b, c, and Pdtv are threshold values for each of the
parameters for noise / silence determination, and the optimum value is experimentally obtained
Determined.
a and b are lower and upper thresholds, respectively, of the presence ratio of highly consonant
frames, c is a presence ratio threshold of a silent section, and Pdtv is a threshold for measuring
the degree of change of power. According to the above-described process, when the silent period
and the consonant period exist in a predetermined amount or more in the F frame and the
change in power is large, it is determined that the source is a voice with high possibility of being
a voice. If one of these three conditions is not met, it is determined that the possibility of voice is
low and the voice is non-voice. The determination result is continuously output from the voice
determination unit 6 at an F frame cycle.
[0020]
As described above, according to the present embodiment, the presence ratio of the consonant of
the sound signal, the presence ratio of silence, and the difference value between the maximum
value and the minimum value of the sound power are determined to be voice by being larger
than predetermined values. By doing this, it is possible to determine whether the sound signal is
voice or not regardless of whether the sound signal is monaural or stereo.
[0021]
Embodiment 2 An embodiment of the present invention according to claim 2 will be described
08-05-2019
7
below.
The configuration of this embodiment is the same as that shown in FIG. 1 in the form of a block
diagram. The present embodiment is different from the first embodiment in the discrimination
operation of the voice discrimination unit 6. In addition, the discrimination as to whether the
acoustic signal is voice or not is the same as in Example 1; the consonant presence ratio, the
silence presence ratio, and the maximum / minimum difference value of power are each a <100
× NZ / F <b. , (Np / F)> c, and Pd> Pdtv are satisfied. On the other hand, in the determination of
non-speech, especially when non-speech is limited to music, there is almost no silent section and
the change in power is small (stationary), ie, (Np / F) < It determines that it is non-voice (music)
only when the condition of dPd <Pdtu is satisfied. Here, d is a threshold of the existence ratio of
the silent section for non-voice determination, Pdtu is a threshold for measuring the power
change degree for non-voice determination, and the threshold value c, Pdtu d <c, Pdtu <Pdtv.
[0022]
If neither the voice nor the non-voice determination condition is satisfied, it is determined that
neither can be determined, and the result is undefined. It is possible to prevent an erroneous
determination by determining this as indeterminate and, in the case of being indeterminate, to
prevent the phenomenon that the determination of voice and non-voice is switched in a short
time by holding the previous determination result as it is. it can.
[0023]
As described above, according to the present embodiment, the presence ratio of the consonant of
the sound signal, the presence ratio of silence, and the maximum / minimum difference value of
the sound power are determined to be voice when they are larger than predetermined values.
Therefore, regardless of whether the audio signal is monaural or stereo, it can be determined
whether it is voice or not, and if it is not voice, the continuity of the signal and the maximum /
minimum difference value is smaller than voice, so that it is non-voice like music. In other cases,
it can be determined to be indeterminate.
[0024]
(Embodiment 3) Hereinafter, an embodiment of a sound reproducing apparatus according to the
present invention as claimed in claims 3 and 4 will be described with reference to the drawings.
08-05-2019
8
FIG. 2 is a block diagram showing the configuration of this embodiment. In the figure, reference
numeral 7 denotes an audio / music discrimination unit, which outputs a determination result as
to whether the section is audio or music at regular intervals. A frequency characteristic control
unit 8 gradually switches to a frequency characteristic suitable for voice or music based on the
determination result of the audio / music discrimination unit 7. FIG. 3 shows an example of a
frequency characteristic diagram switched by the frequency characteristic control unit 8.
[0025]
The operation of the above configuration will be described. First, the audio / music
discrimination unit 7 inputs the (L + R) signal, makes a determination of speech, music or
indeterminate every fixed period (F frame section), and outputs the result to the frequency
characteristic control unit 8. The operation of the audio / music discrimination unit 7 is the same
as the operation of the audio discrimination device in the first embodiment, so the description
will be omitted. Also consider non-voice as music here. In the frequency characteristic control
unit 8, for example, ten frequency characteristics set in advance as shown in FIG. 3 are prepared,
and if the input signal is an audio source, the frequency characteristic finally becomes one. As in
the case of music source, control is made to have 10 frequency characteristics.
[0026]
Now, it is assumed that the characteristic of 5 is set as the initial state of the frequency
characteristic. When the determination result of the voice is received from the voice and music
determination unit 7, the feature is changed to the feature 4 to approach the feature 1 of the
one-step voice. When the judgment result of music is received, the characteristic of 10 is brought
closer to one step, and the characteristic of 6 is changed. Also, in the case of an indefinite
determination result, the current state 5 is maintained. By repeating this operation based on the
result of determination of the audio music sent for every F frame, for example, if the
determination result of voice continues, it gradually approaches characteristics suitable for audio
reproduction, and finally the characteristic 1 Is set and fixed in that state until the next
determination result of music is received.
[0027]
08-05-2019
9
As described above, according to the present embodiment, the audio / music discrimination unit
7 that discriminates whether the source is voice or music, and the frequency characteristic
control unit that gradually approaches frequency characteristics suitable for the source based on
the discrimination result By providing 8, the frequency characteristic of the device can be
automatically changed to the frequency characteristic suitable for the input source, and an easyto-hear sound reproducing device can be realized. In addition, by switching in stages rather than
switching to characteristics optimum for voice and music all at once, there is no sense of
discomfort due to changes in frequency characteristics.
[0028]
The voice / music discrimination device may be any of the voice discrimination devices of the
present invention according to claim 1 or claim 2.
[0029]
As is apparent from the above description, the invention according to claim 1 sets in advance the
power calculation unit for calculating the sound power of the sound signal for each frame of a
fixed time, and the calculated sound power value in advance. And a zero crossing calculating unit
for calculating the number of times of zero crossing of the waveform of the acoustic signal for
each frame, and a calculated zero. A consonant determination unit that determines the consonant
nature of the frame by comparing the number of crossings with a predetermined threshold value,
detects the maximum value and the minimum value of the power values in a series of
predetermined plural frame intervals, and detects the difference value The stationarity
determining unit to calculate, the existence ratio of frames determined to be silent in the plurality
of frames, the existence ratio of frames determined to have high consonant properties, and the
difference value respectively. If all the values are larger than the preset threshold value, the
sound signal in the plurality of frame sections is determined to be voice, and otherwise, the
sound signal in the plurality of frame sections is determined to be non-voice. Whether or not the
audio signal is an audio signal can be determined regardless of whether the audio signal is
monaural or stereo, by providing an audio determination unit that outputs the determination
result to The invention is directed to a power calculator configured to calculate the sound power
of the sound signal for each frame of a fixed time, and to a sounded sound comparing the
calculated sound power value with a preset threshold value to determine the sound and silence
of the frame. A silence determination unit, a zero crossing calculation unit that calculates the
number of zero crossings of the waveform of the acoustic signal for each frame, and the
calculated number of times of zero crossings are compared with a preset threshold value. A
consonant determination unit that determines the consonant nature of the frame, a stationarity
determination unit that detects the maximum value and the minimum value of the power values
in consecutive predetermined plural frame sections, and calculates a difference value thereof;
08-05-2019
10
silence in the plurality of frames If the presence ratio of frames determined to be high, the
presence ratio of frames determined to be high in consonantness, and the difference value are all
greater than the preset threshold values, the acoustic signals in the plurality of frames are
determined. Is determined to be voice and not determined to be voice, and the presence ratio of
frames determined to be silent in the plurality of frames and the difference value are respectively
set to be smaller than the threshold value smaller than the threshold value in advance. If it is
smaller, the acoustic signal in the plurality of frame sections is determined as non-speech,
otherwise the acoustic signal in the plurality of frames is determined. No. is determined to be
indeterminate, and a voice determination unit that outputs the determination result for each of a
plurality of frames is provided, so that whether the audio signal is monaural or stereo is voice or
non-voice, or not It is possible to determine whether it is neither fixed nor indeterminate, and the
invention according to claim 3 and claim 4 is an audio discrimination device that inputs an audio
signal and performs audio non-audio discrimination, the audio signal and the audio
discrimination. The audio non-speech discrimination result for each predetermined time of the
device is input, and the frequency characteristic is changed stepwise to the optimum
characteristic for the acoustic signal according to the speech non-speech discrimination result
and output. Regardless of whether the signal is monaural or stereo, it can be reproduced with a
frequency characteristic that automatically corresponds to whether it is voice or not.
08-05-2019
11