JP2016080750

Patent Translate
Powered by EPO and Google
Notice
This translation is machine-generated. It cannot be guaranteed that it is intelligible, accurate,
complete, reliable or fit for specific purposes. Critical decisions, such as commercially relevant or
financial decisions, should not be based on machine-translation output.
DESCRIPTION JP2016080750
Abstract: A voice from any direction is accurately recognized without delay associated with
estimation of a sound source direction. According to one embodiment, a speech recognition
apparatus (1) has a directivity control unit (12) for acquiring each of speech streams from a
plurality of directions, and each of speech streams from a plurality of directions acquired by the
directivity control unit (12). When the first speech recognition processing unit 13 that executes
speech recognition and the first speech recognition processing unit 13 obtain speech recognition
results that meet the criteria of a predetermined reliability, the speech recognition results are
obtained. And a sound source direction determination unit that determines a direction
corresponding to the audio stream as a sound source direction. [Selected figure] Figure 1
Speech recognition apparatus, speech recognition method, and speech recognition program
[0001]
The present invention relates to a speech recognition apparatus that recognizes speech from any
direction, a speech recognition method, and a speech recognition program.
[0002]
In recent years, techniques for performing device operations, information acquisition, and
dialogue by voice recognition are becoming widespread.
In particular, when making a device such as a robot execute voice recognition and execute
03-05-2019
1
processing based on the result of voice recognition, it is required to be able to accurately
recognize voice coming from any direction with respect to the device. An apparatus is known
which estimates the sound source direction for such purpose and sets the pointing direction of
the microphone array in the sound source direction.
[0003]
For example, in Patent Document 1, in addition to a spatial filter having a blind spot directed to
the sound source direction, a spatial filter having directivity directed to the sound source
direction is also generated, patterns of direction and gain are determined for each, and both
patterns are used. It is described to estimate the sound source direction. Further, Patent
Document 2 describes that the MUSIC method is used as a method of estimating a sound source
direction. Further, Patent Document 3 describes that a sound of a speaker sipping a palm is
detected as a cue sound to set a pointing direction of a microphone array.
[0004]
JP, 2012-150237, A JP, 2010-121975, A International publication No. 2011/054510
[0005]
However, in the methods described in Patent Documents 1 and 2, there is a possibility that the
pointing direction is set for a sound source such as ambient noise different from the voice
emitted by the device user.
Also, in order to estimate the sound source direction, it is necessary to observe the audio signal
over a certain period. Moreover, according to the method described in Patent Document 3
described above, it is possible to prevent the directivity direction from being set for ambient
noise etc. It is necessary to perform an operation such as tapping. That is, since speech
recognition is started after estimation of the sound source direction, it takes time to obtain the
speech recognition result, and the user's sensational value is lost.
[0006]
The present invention has been made in view of the above problems, and is a speech recognition
apparatus and speech recognition method that can accurately recognize speech from any
03-05-2019
2
direction without delay accompanying estimation of a sound source direction. And providing a
speech recognition program.
[0007]
A speech recognition apparatus according to the present invention performs speech recognition
on speech acquisition means for acquiring speech streams from a plurality of directions and
speech streams for a plurality of directions acquired by the speech acquisition means. When a
voice recognition result satisfying the predetermined reliability criteria is obtained by the voice
recognition processing means and the voice recognition processing means, the direction
corresponding to the voice stream for which the voice recognition result is obtained is taken as
the sound source direction. And sound source direction determining means for determining.
[0008]
A speech recognition method according to the present invention is a speech recognition method
executed by a speech recognition device, comprising: a speech acquisition step of acquiring
speech streams from a plurality of directions; and a plurality of directions acquired in the speech
acquisition step. When a speech recognition result satisfying a predetermined reliability criterion
is obtained in the speech recognition processing step of executing speech recognition on each of
the speech streams from the speech recognition processing step, and the speech recognition
processing step Determining a direction corresponding to the obtained audio stream as a sound
source direction.
[0009]
A speech recognition program according to the present invention comprises a computer, speech
acquisition means for acquiring speech streams from a plurality of directions, and speech
recognition for speech streams from a plurality of directions acquired by the speech acquisition
means. When a speech recognition result satisfying the predetermined reliability criterion is
obtained by the speech recognition processing means for executing the step and the speech
recognition processing means, the direction corresponding to the speech stream for which the
speech recognition result is obtained is It is performed as a sound source direction determining
means determined as a sound source direction.
[0010]
In the speech recognition apparatus according to the present invention, the speech recognition
processing means obtains speech recognition results for each of the speech streams from a
03-05-2019
3
plurality of directions acquired by the speech acquisition means.
At the same time, when the speech recognition processing means obtains the speech recognition
result satisfying the predetermined reliability criteria, the sound source direction determining
means determines the direction corresponding to the speech stream for which the speech
recognition result is obtained as the sound source. Determined as the direction.
As described above, according to the voice recognition apparatus, it is possible to determine the
sound source direction while performing voice recognition continuously, instead of determining
the sound source direction and then starting the voice recognition.
That is, it is possible to perform speech recognition on speech from any direction without delay
associated with estimation of the sound source direction.
In addition, after the sound source direction is determined, for example, it is possible to perform
more accurate speech recognition on the sound stream from the determined sound source
direction, so that it is possible to improve the speech recognition accuracy. It becomes. Therefore,
according to the voice recognition apparatus, voice from any direction can be recognized with
high accuracy without delay accompanying estimation of the sound source direction.
[0011]
The speech recognition device performs speech recognition with higher accuracy than speech
recognition by the speech recognition processing means, for the speech stream from the sound
source direction determined by the sound source direction determination means among the
speech streams acquired by the speech acquisition means. It may further comprise a second
speech recognition processing means to execute.
[0012]
According to the above voice recognition apparatus, when the sound source direction is
determined by the sound source direction determining means, the second voice recognition
processing means performs more accurate voice recognition on the voice stream from the sound
source direction. Can.
[0013]
03-05-2019
4
In the above speech recognition apparatus, the speech acquisition unit may acquire speech
streams corresponding to the beam direction of each directional beam by setting the directional
beams in a plurality of predetermined directions.
[0014]
According to the voice recognition device, voice streams from each of a plurality of
predetermined directions (fixed directions) can be obtained with high accuracy.
[0015]
In the above speech recognition apparatus, the speech acquisition means sets the directional
beams in a plurality of directions which become candidates for the sound source direction
estimated by the predetermined method, thereby to generate an audio stream corresponding to
the beam direction of each directional beam. You may get it.
[0016]
According to the above speech recognition apparatus, for example, by setting directional beams
in a plurality of directions which become candidates for the sound source direction estimated by
the MUSIC method or the like, priority is given to the audio stream from the direction having
high possibility of being the sound source direction. Can be acquired, and the accuracy of speech
recognition can be improved.
[0017]
In the above speech recognition apparatus, the speech recognition processing means may
determine that the speech recognition result satisfies the predetermined reliability standard
when the predetermined word is included in the speech recognition result. .
[0018]
According to the above speech recognition apparatus, it is possible to simply and accurately
determine the reliability of the speech recognition result based on whether or not a
predetermined word is speech-recognized.
[0019]
In the above speech recognition apparatus, the speech recognition processing means may
execute a speech zone detection process for detecting a speech zone, and may execute speech
recognition on the speech zone detected by the speech zone detection process.
03-05-2019
5
[0020]
According to the above speech recognition apparatus, it becomes possible to execute speech
recognition only for the speech segment detected by the speech segment detection process.
As a result, it is possible to prevent the execution of useless speech recognition processing for
sections other than the speech section in the speech stream, and power consumption can be
reduced.
[0021]
According to the present invention, it is possible to accurately recognize speech from any
direction without delay associated with estimation of the sound source direction.
[0022]
It is a block diagram showing functional composition of a speech recognition device concerning
an embodiment of the present invention.
It is a block diagram which shows the hardware constitutions of a speech recognition apparatus.
It is a figure which shows an example of the beam direction set by several microphones.
It is a flowchart which shows operation ¦ movement of a speech recognition apparatus.
It is a block diagram which shows the module structure of a speech recognition program.
[0023]
03-05-2019
6
Hereinafter, a speech recognition apparatus, a speech recognition method, and a speech
recognition program according to an embodiment of the present invention will be described with
reference to the drawings.
Where possible, the same parts will be denoted by the same reference symbols, without
redundant description.
FIG. 1 is a block diagram showing a functional configuration of the speech recognition apparatus
1 according to the present embodiment.
As shown in FIG. 1, the speech recognition apparatus 1 includes a speech input unit 11, a
directivity control unit 12, a first speech recognition processing unit 13, a sound source direction
determination unit 14, and a second speech recognition processing unit 15. And a speech
recognition result output unit 16.
[0024]
The voice recognition device 1 is configured as a device that performs voice recognition of the
user's speech and performs processing according to the voice recognition result.
For example, the voice recognition device 1 is installed in a living room in the home, configured
as a device that voice-recognizes the voice of the user's voice and instructs the home appliance
etc. to execute processing according to the voice recognition result by radio wave or the like. It
may be incorporated in the device itself which executes processing according to the result of
speech recognition.
In addition, the speech recognition apparatus 1 may be configured as a speech dialogue
apparatus (for example, a robot or the like) that presents the user with the result of a response to
an inquiry from the user by using text and voice.
[0025]
03-05-2019
7
FIG. 2 is a block diagram showing an example of the hardware configuration of the speech
recognition apparatus 1. As shown in FIG. 2, the speech recognition apparatus 1 has, for
example, a central processing unit (CPU) 10A, a random access memory (RAM) 10B, a read only
memory (ROM) 10C, and an input device 10D as a hardware configuration. And a communication
device 10E such as a radio wave module that communicates with an external device, an auxiliary
storage device 10F, and an output device 10G. The input device 10D includes a plurality of
microphones corresponding to the voice input unit 11, and further includes, for example, a
keyboard and a mouse which are input devices. The output device 10G is, for example, a display
that outputs the response result as text, a speaker that outputs the response result as sound, or
the like. Each function of the speech recognition apparatus 1 is realized, for example, by reading
a speech recognition program P described later in the RAM 10B or the like and causing the CPU
10A to execute the speech recognition program P.
[0026]
The speech recognition apparatus 1 does not necessarily have to have all the hardware
configurations described above. For example, when the speech recognition apparatus 1 does not
have the function of outputting the response result as text and speech, the speech recognition
apparatus 1 may not include the output device 10G. Also, the speech recognition apparatus 1
may be configured physically as a single device, or may be configured such that a plurality of
physically separated devices operate in coordination.
[0027]
The voice input unit 11 is a voice input unit that collects sounds around the voice recognition
device 1 and acquires the sounds as signals of a plurality of channels (a plurality of frequency
bands). The voice input unit 11 is configured of, for example, a plurality of microphones.
[0028]
The directivity control unit 12 is an audio acquisition unit that acquires each of audio streams
from a plurality of directions. The directivity control unit 12 executes signal processing for
emphasizing only the voice arriving from the preset beam direction and suppressing the voice
arriving from the other direction by using a known method such as a fixed beam former, for
example. . More specifically, the directivity control unit 12 executes the above-described signal
03-05-2019
8
processing on the signals of the plurality of channels obtained from the voice input unit 11 to
obtain each of the plurality of beam directions from each beam direction. Only incoming speech
is enhanced, producing speech streams in which speech coming from other beam directions is
suppressed.
[0029]
The directivity control unit 12 may obtain an audio stream corresponding to the beam direction
of each directional beam by setting the directional beams in a plurality of predetermined
directions. That is, the plurality of beam directions set by the directivity control unit 12 may be
fixed beam directions set in advance. As a result, audio streams from each of a plurality of
predetermined directions (fixed directions) can be accurately obtained. In particular, when it is
not known in advance from which direction the sound comes, it is sufficient to set the beam
direction so that a plurality of beams cover all directions as shown in FIG. In the example of FIG.
3, the beam directions are set in eight directions divided at 45 degree intervals in the horizontal
direction with the speech recognition device 1 as the center. By setting the beam direction in this
manner, the directivity control unit 12 emphasizes the voice coming from the sound source X
ahead of the beam direction a set from the speech recognition apparatus 1 toward the upper
right in FIG. Audio stream can be generated.
[0030]
Further, in the case where the candidates for the source (sound source direction) of the voice
input to the voice recognition device 1 are limited to a certain range, the directivity control unit
12 sets the plurality of beams to only the certain range. The beam direction may be set to cover.
For example, it is assumed that the voice recognition apparatus 1 is incorporated in a television
receiver, and the user instructs the television receiver to execute a predetermined operation (for
example, channel change) by speech. In this case, it is assumed that the position of the user is in
front of the screen of the television receiver. That is, the candidates for the sound source
direction are limited to the range of 180 degrees in front of the screen of the television receiver.
Therefore, in this case, the directivity control unit 12 may set the beam direction so as to cover
only the range of 180 degrees in front of the screen of the television receiver. Although the
number of beam directions to be set depends on the number of microphones and the sharpness
of directivity set by signal processing, etc., it is possible to set more beam directions than the
number of microphones.
03-05-2019
9
[0031]
Further, the directivity control unit 12 sets the directional beams in a plurality of directions
which become candidates for the sound source direction estimated by the predetermined
method, thereby acquiring an audio stream corresponding to the beam direction of each
directional beam. May be That is, the directivity control unit 12 may set, for example, a plurality
of candidates for the sound source direction estimated by the MUSIC method or the like using an
acoustic signal input to the audio input unit 11 as a plurality of beam directions. Alternatively,
the directivity control unit 12 sets, as a plurality of beam directions, a plurality of candidates for
the sound source direction estimated by the sound source direction tracking by a method such as
a Kalman filter or particle filter in addition to the sound source direction estimation by the
MUSIC method or the like. It is also good. When a plurality of beam directions are set in this
manner, the plurality of beam directions are set and changed depending on the acoustic signal,
unlike the case of using the above-described fixed beam former. Thus, by setting the directional
beams in a plurality of directions which are candidates for the sound source direction estimated
by the MUSIC method or the like, it is possible to preferentially acquire an audio stream from the
direction having a high possibility of being the sound source direction. The accuracy of speech
recognition can be improved.
[0032]
The first speech recognition processing unit 13 is a speech recognition processing unit that
executes speech recognition on each of the speech streams from a plurality of directions
acquired by the directivity control unit 12. Hereinafter, each of the audio streams from a plurality
of directions acquired by the directivity control unit 12 is also referred to as a directional audio
stream. A plurality of directional audio streams are continuously acquired by the above-described
processing of the audio input unit 11 and the directivity control unit 12, and are input to the first
speech recognition processing unit 13. For this reason, the first speech recognition processing
unit 13 executes speech recognition continuously for each of the plurality of directional speech
streams.
[0033]
The directional audio stream may contain only noise, not human voice. In addition, the
directional audio stream may include only a section that is substantially silent. Therefore, in
order to prevent the first speech recognition processing unit 13 from falsely recognizing the
03-05-2019
10
noise contained in the directional speech stream as human voice and obtaining an erroneous
speech recognition result, the first speech recognition processing unit 13 applies to each
directional speech stream. A well-known denoising process may be performed prior to
performing speech recognition. In addition, the first speech recognition processing unit 13
performs a known speech zone detection process on the directional speech stream in order to
detect a speech zone (a zone containing a human voice) in which speech recognition is to be
performed. The voice recognition may be performed on the speech segment detected by the
speech segment detection process. As a result, it is possible to prevent the execution of useless
speech recognition processing on a section other than the speech section in the directional sound
stream, and power consumption can be reduced.
[0034]
The first speech recognition processing unit 13 is obtained by performing speech recognition on
each directional speech stream (or each directional speech stream after the above-described
noise removal process and speech segment detection process, etc.). It is determined whether the
determined speech recognition result satisfies a predetermined reliability standard. As the
reliability used for such determination, for example, the reliability based on the likelihood of the
output hypothesis, which is a well-known index in the field of statistical speech recognition, can
be used.
[0035]
When one or more command words to be uttered (voice input) are determined in advance in
order to cause the speech recognition device 1 to perform speech recognition of the user's
speech and execute predetermined processing, the first speech recognition is performed. When a
predetermined command word is included in the speech recognition result, the processing unit
13 may determine that the speech recognition result satisfies the predetermined reliability
standard. In this case, the first speech recognition processing unit 13 may set only the command
word as the vocabulary of the speech recognition target. Thereby, the first speech recognition
processing unit 13 speech-recognizes the directional speech stream, and when the speech
recognition result (that is, the result indicating that any command word is recognized) is
obtained, the speech It can be determined that the recognition result meets a predetermined
reliability criterion. That is, the determination of the reliability of the speech recognition result
can be performed easily and accurately based on whether or not the command word is speechrecognized.
03-05-2019
11
[0036]
When the first speech recognition processing unit 13 obtains a speech recognition result
satisfying a predetermined reliability standard from a directional speech stream, the direction of
arrival of the directional speech stream (ie, the directional speech stream) It is estimated that the
speech voice of a person has arrived from the beam direction corresponding to 1), and the beam
direction is notified to the sound source direction determination unit 14. In addition, the first
speech recognition processing unit 13 outputs the speech recognition result obtained by speech
recognition for each directivity stream to the speech recognition result output unit 16.
[0037]
The sound source direction determining unit 14 determines the direction corresponding to the
audio stream for which the voice recognition result has been obtained, when the first voice
recognition processing unit 13 obtains a voice recognition result that satisfies the criteria of
reliability determined in advance. It is a sound source direction determination means determined
as a sound source direction.
[0038]
The sound source direction determining unit 14 determines the beam direction as the sound
source direction, for example, when notified of the beam direction estimated to be the arrival
direction of the human speech from the first speech recognition processing unit 13 as described
above. Do.
Then, the sound source direction determining unit 14 acquires a directional audio stream
corresponding to the beam direction from the directivity control unit 12, and outputs the
directional audio stream to the second speech recognition processing unit 15. Alternatively, the
sound source direction determining unit 14 receives signals of a plurality of channels obtained
from the voice input unit 11 and performs unique signal processing on the signals, thereby being
notified from the first voice recognition processing unit 13. Directional audio streams
corresponding to different beam directions may be acquired.
[0039]
In the initial state in which no beam direction has been notified from the first speech recognition
03-05-2019
12
processing unit 13 in the past, the sound source direction determination unit 14 causes the
second speech recognition processing unit 15 to set a preset initial beam direction. A
corresponding directional audio stream may be output, or no directional audio stream may be
output.
[0040]
Further, the sound source direction determining unit 14 has been notified of the beam direction
from the first speech recognition processing unit 13 in the past, and then, when receiving a
notification of the beam direction after that, a predetermined time set in advance has passed. You
may return to the state.
Further, when the sound source direction determination unit 14 is notified of one beam direction
from the first speech recognition processing unit 13 in the past, and is notified of another beam
direction from the first speech recognition processing unit 13 at a later time, Alternatively,
another beam direction notified later may be determined as the latest sound source direction,
and a directional voice stream corresponding to the sound source direction may be output to the
second speech recognition processing unit 15.
[0041]
The second speech recognition processing unit 15 performs second speech recognition
processing on the speech stream from the sound source direction determined by the sound
source direction determination unit 14 among the speech streams acquired by the directivity
control unit 12 It is a means. In the speech recognition processing by the second speech
recognition processing unit 15, more vocabularies than the speech recognition by the first
speech recognition processing unit 13 are targeted for speech recognition, and speech
recognition with high accuracy is performed. Thereby, when the sound source direction is
determined by the sound source direction determining unit 14, the second voice recognition
processing unit 15 can perform more accurate voice recognition on the audio stream from the
sound source direction. Become. For example, it is assumed that the voice recognition by the first
voice recognition processing unit 13 is local type voice recognition in which the recognition
processing is executed in the voice recognition device 1, and the voice recognition by the second
voice recognition processing unit 15 is a server using an external server. It may be type speech
recognition. The speech recognition process by the second speech recognition unit 15 may be
the same as the speech recognition process by the first speech recognition unit 13.
03-05-2019
13
[0042]
When the second voice recognition processing unit 15 executes the above-described server type
voice recognition, a server (not shown) having a function of performing voice recognition on the
directional voice stream acquired from the sound source direction determination unit 14 Send it
and have the server perform speech recognition, and its speech recognition result (for example,
text etc.). same as below. ) May be received from the server to perform speech recognition. As
described above, the second speech recognition processing unit 15 causes the server or the like
provided with a high-performance speech recognition engine to execute speech recognition,
thereby to the voice stream from the sound source direction determined by the sound source
direction determination unit 14. More accurate speech recognition can be performed. In addition,
transmission and reception of data between the second speech recognition processing unit 15
and the above-described server is performed, for example, through the Internet, a local area
network (LAN), or the like by using the communication function of the communication device
10E described above. .
[0043]
The second speech recognition processing unit 15 executes speech recognition on the directional
speech stream acquired from the sound source direction determining unit 14 as described above,
and outputs the speech recognition result obtained as a result to the speech recognition result
output unit Output to 16
[0044]
The voice recognition result output unit 16 is a voice recognition result output unit that outputs
the voice recognition result acquired from at least one of the first voice recognition processing
unit 13 and the second voice recognition processing unit 15.
The speech recognition result output unit 16 may output only the speech recognition result
acquired from one of the first speech recognition processing unit 13 and the second speech
recognition processing unit 15 or the first speech recognition processing unit 13 and the second
speech recognition processing unit 13. The text obtained by combining the speech recognition
results acquired from both of the speech recognition processing unit 15 may be output. As a
specific method of outputting the speech recognition result, the speech recognition result output
03-05-2019
14
unit 16 may, for example, present the speech recognition result (text) to the user by outputting
the speech recognition result (text) to the output device 10G such as a display, The voice
corresponding to the voice recognition result may be voice synthesized by a known method, and
the obtained voice may be output to the output device 10G such as a speaker.
[0045]
Further, the voice recognition result output unit 16 may present the user with information
indicating some response result based on the voice recognition result, in addition to outputting
the voice recognition result as text, voice or the like. For example, in the case where the speech
recognition device 1 is configured as a device that executes information search based on the
content of speech from the user, the speech recognition result output unit 16 presents the search
result to the user by text, speech, etc. It is also good. Similarly, when the voice recognition device
1 is configured as a device that executes a predetermined device operation (for example, an
operation to turn off the light through a remote controller) based on the content of the speech
from the user, the voice recognition result output unit 16 may present the user the result of the
device operation (for example, information indicating that the light has been turned off) by text,
voice, or the like. In addition, when the speech recognition device 1 is configured as a device that
realizes dialogue with the user by answering the question of the user such as a chat, the speech
recognition result output unit 16 responds to the question of the user. The message may be
presented to the user in text or voice.
[0046]
Subsequently, an example of processing (voice recognition method) executed by the voice
recognition device 1 will be described with reference to a flowchart shown in FIG. First, voice
from the outside is continuously input to the voice input unit 11 configured by a plurality of
microphones and the like. Then, the directivity control unit 12 performs signal processing on the
signals of the plurality of channels obtained from the audio input unit 11 to cope with each of
the audio streams from a plurality of directions (that is, corresponding to each of a plurality of
beam directions). Directional audio stream) is acquired (step S1, audio acquisition step).
[0047]
Subsequently, speech recognition is performed on each directional speech stream by the first
03-05-2019
15
speech recognition processing unit 13 (step S2, speech recognition processing step). Then, it is
judged by the first speech recognition processing unit 13 whether or not the speech recognition
result for each directional speech stream satisfies a predetermined reliability standard (step S3).
[0048]
If it is determined in step S3 that the speech recognition result satisfying the predetermined
reliability criteria is obtained (step S3: YES), the first speech recognition processing unit 13
obtains the speech recognition result. The beam direction corresponding to the directional audio
stream is notified to the sound source direction determination unit 14. Then, the sound source
direction determining unit 14 determines the direction corresponding to the audio stream for
which the voice recognition result is obtained as the sound source direction (step S4, sound
source direction determining step), and the directional audio stream corresponding to the sound
source direction is It is output to the second speech recognition processing unit 15.
[0049]
On the other hand, when it is not determined in step S3 that the speech recognition result
satisfying the reference of the predetermined reliability is obtained (step S3: NO), the sound
source direction determining unit 14 from the first speech recognition processing unit 13 The
beam direction (the beam direction estimated to be the sound source direction) is not notified,
and the sound source direction determination unit 14 sets the initial beam direction (or the first
speech recognition processing unit 13 within a certain period in the past). The directional voice
stream corresponding to the notified beam direction is output to the second speech recognition
processing unit 15.
[0050]
Subsequently, the second speech recognition processing unit 15 executes speech recognition on
the speech stream from the set sound source direction (step S5, second speech recognition
processing step).
Here, the "set sound source direction" is determined in step S4 when it is determined that the
speech recognition result satisfying the reference of the reliability predetermined in step S3 is
obtained (step S3: YES). The sound source direction determined by the sound source direction
determining unit 14 is not set if it is not determined that the speech recognition result satisfying
03-05-2019
16
the reference of the reliability predetermined in step S3 is obtained (step S3: NO) This is the
initial beam direction (or the beam direction notified from the first speech recognition processing
unit 13 within a fixed period in the past).
[0051]
Subsequently, the speech recognition result output unit 16 outputs at least one of the speech
recognition result for each directional speech stream by the first speech recognition processing
unit 13 and the speech recognition result by the second speech recognition processing unit 15
(step S6). Here, the voice recognition result may be output in the form that the voice recognition
result is presented to the user as text or voice as it is, or information indicating some response
result based on the voice recognition result (for example, search result, device As a result of the
operation, an answer message to the user's inquiry may be output in the form of being presented
to the user as text, voice or the like.
[0052]
In the speech recognition apparatus 1 described above, the first speech recognition processing
unit 13 obtains speech recognition results for each of speech streams from a plurality of
directions obtained by the directivity control unit 12. At the same time, when the first speech
recognition processing unit 13 obtains a speech recognition result satisfying the criteria of
reliability predetermined in advance, the sound source direction determination unit 14 copes
with the speech stream for which the speech recognition result is obtained. Direction is
determined as the sound source direction. As described above, according to the voice recognition
device 1, it is possible to determine the sound source direction while executing voice recognition
continuously, instead of determining the sound source direction and then starting the voice
recognition. That is, it is possible to perform speech recognition on speech from any direction
without delay associated with estimation of the sound source direction.
[0053]
More specifically, regardless of whether the sound source direction is determined, the first
speech recognition processing unit 13 continuously executes speech recognition. Therefore,
when the user makes an utterance for causing the speech recognition device 1 to execute some
process (for example, the above-mentioned device operation etc.), at least the first speech
03-05-2019
17
recognition processing unit 13 executes speech recognition immediately. . Therefore, when the
speech recognition is successful, the speech recognition apparatus 1 can appropriately execute
some processing based on the speech recognition result without delay associated with the
estimation of the sound source direction.
[0054]
Furthermore, after the sound source direction is determined, the second speech recognition
processing unit 15 improves the speech recognition accuracy by executing more accurate speech
recognition on the speech stream from the determined sound source direction. It becomes
possible. Therefore, according to the speech recognition apparatus 1, speech from any direction
can be recognized with high accuracy without delay accompanying estimation of the sound
source direction.
[0055]
Subsequently, a speech recognition program for causing a computer to execute the processing by
the series of speech recognition devices 1 described above will be described. The voice
recognition program P1 is inserted into a computer and accessed, or is stored in a program
storage area formed in a recording medium provided in the computer.
[0056]
As shown in FIG. 5, the voice recognition program P1 includes a voice input module P11, a
directivity control module P12, a first voice recognition processing module P13, a sound source
direction determination module P14, a second voice recognition processing module P15, and a
voice recognition result. It comprises and comprises an output module P16. The functions
realized by executing the voice input module P11, the directivity control module P12, the first
voice recognition processing module P13, the sound source direction determination module P14,
the second voice recognition processing module P15, and the voice recognition result output
module P16 are The voice input unit 11, the directivity control unit 12, the first voice recognition
processing unit 13, the sound source direction determination unit 14, the second voice
recognition processing unit 15, and the voice recognition result output unit 16 of the voice
recognition device 1 described above It is similar.
03-05-2019
18
[0057]
The voice recognition program P1 may be configured such that a part or all of the voice
recognition program P1 is transmitted via a transmission medium such as a communication line,
received by another device, and recorded (including installation). Also, each module of the speech
recognition program P1 may be installed on any of a plurality of computers instead of one
computer. In that case, the above-described series of processes of the speech recognition
apparatus 1 are performed by a computer system using the plurality of computers.
[0058]
As mentioned above, although preferred embodiment and modification of the present invention
were described, the present invention is not limited to the above-mentioned embodiment, and
various modification is possible in the range which does not deviate from the gist.
[0059]
DESCRIPTION OF SYMBOLS 1 ... Speech recognition apparatus, 11 ... Speech input part, 12 ...
Directionality control part, 13 ... 1st speech recognition processing part, 14 ... Sound source
direction determination part, 15 ... 2nd speech recognition processing part, 16 ... Speech
recognition result output Part, P1 ... voice recognition program, P11 ... voice input module, P12 ...
directivity control module, P13 ... first voice recognition processing module, P14 ... sound source
direction determination module, P15 ... second voice recognition processing module, P16 ... voice
recognition Result output module.
03-05-2019
19